Open iskim517 opened 1 year ago
This is a nice example of how solving one problem introduces another. The change was made to reduce stack usage, for the benefit of environments with limited stack sizes (e.g. applications with many threads). If you are seeing 4MB heap usage, this means that some matches would, in the past, have had to use the heap (the stack allocation being too small), but they would have freed the heap afterwards. In the new arrangement, the heap is retained in order to reduce the number of malloc/free calls.
The HEAD code, in the last few days, has been updated to add a new function called pcre2_get_match_data_heapframes_size which returns the size of the heap that is being retained. This should allow you to do selective rather than aggressive pcre2_match_data_free, and is intended to help with your situation.
Adding a flag to use the stack sometimes would be extremely messy, and I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.
this means that some matches would, in the past, have had to use the heap (the stack allocation being too small)
Actually it isn't always the case, I think. It depends on how we use pcre2. For example, here is how libselinux
works.
libselinux
has a list of pcre2_match_data
.libselinux
searches for a matching pattern, libselinux
loops through the list and calls pcre2_match
.pcre2_match_data
is freed at once.So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.
Adding a flag to use the stack sometimes would be extremely messy
Of course.
I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.
Then one question: should we always use pcre2_match_data_free
which frees all data? Can't we just shrink the heap frames size? I think it can at least halve the heap alloc/free count.
You mean have a function that calls realloc()? Something like pcre2_match_data_realloc_heapframes()? I suppose that would save freeing and allocating the match data block itself. It could realloc() down to the minimum size, which is 28K. Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).
Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).
Or we could add a "realloc" to it, which will come handy here and in a few other places. It will also help systems that have less memory (or address space) to keep going even when they need bigger memory blocks, as now we have to keep the old block around until memory is copied to the new one to free it.
So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.
One important thing to consider is that "heap" doesn't need to be always allocated, PCRE2 provides a custom memory manager, so you could use that to allocate memory from a pool in the stack instead up to a limit.
For an example of how that works, take a look at the Apache httpd code.
Adding a custom realloc() facility would need a new function such as pcre2_general_context_create_with_realloc, but I suppose it would not be too much upheaval. It would increase the size of a number of data blocks by a pointer (probably 8 bytes in most cases). I suppose the obvious specification is "If realloc() is not supplied, use free+malloc". Note, however, that in the case we are discussing, that is, reducing the heapframes vector that is remembered in match data, there is no need to remember the contents of the memory. So perhaps just using free+malloc would be good enough. On the other hand, having realloc() available when increasing the size of the heapframes vector could be useful.
Yes, sorry to hijack the conversation, as you pointed out is not very relevant to the issue presented here which IMHO could be better served with a different solution (like the one I proposed and used by Apache) or the selective freeing that is allowed by the new API.
Not sure where the 10 x framesize or 20K minimum came from, but it would seem that there are users that require normally much less, and for them there will always be increased VMM sizes compared with the pre 10.41 library. One way to soften the blow would be to make the minimum smaller through a configuration instead of shrinking it after the fact IMHO.
It would be easy, I think, to provide pcre2_match_data_reset_heapframes() which just frees them, and it could have an argument giving a size -- only free if larger. Also, we could provide a "threshold" setting (like heap limit) which would trigger automatic freeing after a match if the frames got too big. As for the 10 x framesize and 20K minimum, the latter was, I think the size of the previous stack vector, and 10x came off the top of my head as probably enough for many straightforward matches (perhaps I looked at the tests).
The problem with doing that (which BTW I proposed and discarded in #190) was that there is no practical difference between doing a reset or recreating the match_data anyway.
Also, it wouldn't help the case presented here, since the problem is not that the frames had grown too much, but that now they are being kept allocated while before the use of that memory was hidden and discarded when the stack was destroyed, except for the matches that were bigger than what could fit in the stack.
There is a (possibly small) difference: if you just free the heap frames, then only one malloc is needed to re-create them. If you free the whole match data that's two frees and two mallocs. However, I think I prefer my "threshold" suggestion above -- which amounts to "only keep the heapframes vector if it isn't too big". Setting the threshold to zero would be the same as a reset.
Setting the threshold to zero would be the same as a reset.
but that would just force another malloc for the next use of the match_data, making the performance worst (as compared with pcre < 10.41).
the interesting thing about the new heapframes is that it actually improves performance and reduces memory usage if used correctly, which in this case means:
after these changes, the overall maximum memory usage will be only bounded to the concurrency instead to the number of expressions evaluated and the process will be faster and more lean.
My "threshold" suggestion is just an automatic way of implementing your last point - though no need to recreate because that will happen automatically at the next match.
FWIW went ahead and implemented a proposal to fix libselinux based on the first two points and the results look encouraging even if the code is not finished and my setup is not ideal.
Running heaptrack restorecon -vRn /usr
with the different versions of the code in Debian 12 shows:
HEAD (includes workaround but not enabled) + system pcre2:
total runtime: 43.16s.
calls to allocation functions: 469370 (10876/s)
temporary memory allocations: 215204 (4986/s)
peak heap memory consumption: 10.11M
peak RSS (including heaptrack overhead): 22.62M
total memory leaked: 1.81K
proposal + system pcre2:
total runtime: 32.14s.
calls to allocation functions: 463711 (14428/s)
temporary memory allocations: 215203 (6696/s)
peak heap memory consumption: 9.27M
peak RSS (including heaptrack overhead): 21.77M
total memory leaked: 1.90K
proposal + new pcre2:
total runtime: 29.02s.
calls to allocation functions: 464810 (16015/s)
temporary memory allocations: 215191 (7414/s)
peak heap memory consumption: 2.47M
peak RSS (including heaptrack overhead): 7.36M
total memory leaked: 22.40K
There might be a bug somewhere (maybe in heaptrack, because ASAN doesn't see it) as it seems we leaked a whole heapframe, but otherwise the improvements are significant (using new pcre2):
HEAD
total runtime: 39.96s.
calls to allocation functions: 536789 (13433/s)
temporary memory allocations: 214446 (5366/s)
peak heap memory consumption: 59.57M
peak RSS (including heaptrack overhead): 20.59M
total memory leaked: 1.91K
HEAD (workaround enabled)
total runtime: 226.04s.
calls to allocation functions: 509162297 (2252521/s)
temporary memory allocations: 254421578 (1125555/s)
peak heap memory consumption: 2.63M
peak RSS (including heaptrack overhead): 7.79M
total memory leaked: 1.91K
Well, if there's nothing for me to do, I am very happy to do it.
Has this issue gone away?
Hi,
We observed that this refactoring commit (Refactor match_data() to always use the heap instead of having an initial frames vector on the stack) causes about 4 MB increase of peak memory usage on Android device, mainly due to libselinux. So we tried aggressively doing pcre2_match_data_free(), but that of course massively impacted runtime on some allocators like jemalloc.
Can pcre2 add a flag, so we can choose to use stack as before?