Refactoring on match_data may increase regression on peak memory usage

iskim517 commented 1 year ago

Hi,

We observed that this refactoring commit (Refactor match_data() to always use the heap instead of having an initial frames vector on the stack) causes about 4 MB increase of peak memory usage on Android device, mainly due to libselinux. So we tried aggressively doing pcre2_match_data_free(), but that of course massively impacted runtime on some allocators like jemalloc.

Can pcre2 add a flag, so we can choose to use stack as before?

PhilipHazel commented 1 year ago

This is a nice example of how solving one problem introduces another. The change was made to reduce stack usage, for the benefit of environments with limited stack sizes (e.g. applications with many threads). If you are seeing 4MB heap usage, this means that some matches would, in the past, have had to use the heap (the stack allocation being too small), but they would have freed the heap afterwards. In the new arrangement, the heap is retained in order to reduce the number of malloc/free calls.

The HEAD code, in the last few days, has been updated to add a new function called pcre2_get_match_data_heapframes_size which returns the size of the heap that is being retained. This should allow you to do selective rather than aggressive pcre2_match_data_free, and is intended to help with your situation.

Adding a flag to use the stack sometimes would be extremely messy, and I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.

iskim517 commented 1 year ago

this means that some matches would, in the past, have had to use the heap (the stack allocation being too small)

Actually it isn't always the case, I think. It depends on how we use pcre2. For example, here is how libselinux works.

libselinux has a list of pcre2_match_data.
Whenever libselinux searches for a matching pattern, libselinux loops through the list and calls pcre2_match.
Later, pcre2_match_data is freed at once.

So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.

Adding a flag to use the stack sometimes would be extremely messy

Of course.

I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.

Then one question: should we always use pcre2_match_data_free which frees all data? Can't we just shrink the heap frames size? I think it can at least halve the heap alloc/free count.

PhilipHazel commented 1 year ago

You mean have a function that calls realloc()? Something like pcre2_match_data_realloc_heapframes()? I suppose that would save freeing and allocating the match data block itself. It could realloc() down to the minimum size, which is 28K. Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).

carenas commented 1 year ago

Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).

Or we could add a "realloc" to it, which will come handy here and in a few other places. It will also help systems that have less memory (or address space) to keep going even when they need bigger memory blocks, as now we have to keep the old block around until memory is copied to the new one to free it.

carenas commented 1 year ago

So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.

One important thing to consider is that "heap" doesn't need to be always allocated, PCRE2 provides a custom memory manager, so you could use that to allocate memory from a pool in the stack instead up to a limit.

For an example of how that works, take a look at the Apache httpd code.

PhilipHazel commented 1 year ago

Adding a custom realloc() facility would need a new function such as pcre2_general_context_create_with_realloc, but I suppose it would not be too much upheaval. It would increase the size of a number of data blocks by a pointer (probably 8 bytes in most cases). I suppose the obvious specification is "If realloc() is not supplied, use free+malloc". Note, however, that in the case we are discussing, that is, reducing the heapframes vector that is remembered in match data, there is no need to remember the contents of the memory. So perhaps just using free+malloc would be good enough. On the other hand, having realloc() available when increasing the size of the heapframes vector could be useful.

carenas commented 1 year ago

Yes, sorry to hijack the conversation, as you pointed out is not very relevant to the issue presented here which IMHO could be better served with a different solution (like the one I proposed and used by Apache) or the selective freeing that is allowed by the new API.

Not sure where the 10 x framesize or 20K minimum came from, but it would seem that there are users that require normally much less, and for them there will always be increased VMM sizes compared with the pre 10.41 library. One way to soften the blow would be to make the minimum smaller through a configuration instead of shrinking it after the fact IMHO.

PhilipHazel commented 1 year ago

It would be easy, I think, to provide pcre2_match_data_reset_heapframes() which just frees them, and it could have an argument giving a size -- only free if larger. Also, we could provide a "threshold" setting (like heap limit) which would trigger automatic freeing after a match if the frames got too big. As for the 10 x framesize and 20K minimum, the latter was, I think the size of the previous stack vector, and 10x came off the top of my head as probably enough for many straightforward matches (perhaps I looked at the tests).

carenas commented 1 year ago

The problem with doing that (which BTW I proposed and discarded in #190) was that there is no practical difference between doing a reset or recreating the match_data anyway.

Also, it wouldn't help the case presented here, since the problem is not that the frames had grown too much, but that now they are being kept allocated while before the use of that memory was hidden and discarded when the stack was destroyed, except for the matches that were bigger than what could fit in the stack.

PhilipHazel commented 1 year ago

There is a (possibly small) difference: if you just free the heap frames, then only one malloc is needed to re-create them. If you free the whole match data that's two frees and two mallocs. However, I think I prefer my "threshold" suggestion above -- which amounts to "only keep the heapframes vector if it isn't too big". Setting the threshold to zero would be the same as a reset.

carenas commented 1 year ago

Setting the threshold to zero would be the same as a reset.

but that would just force another malloc for the next use of the match_data, making the performance worst (as compared with pcre < 10.41).

the interesting thing about the new heapframes is that it actually improves performance and reduces memory usage if used correctly, which in this case means:

allocate 1 match_data per thread and reuse it for all matches (libselinux doesn't care about the captures but is creating them from the expressions by mistake).
remove the lock since there is no need to protect pcre2_code against concurrent use.
maybe use the new API to detect a match_data that has grown too much and recreate it.

after these changes, the overall maximum memory usage will be only bounded to the concurrency instead to the number of expressions evaluated and the process will be faster and more lean.

PhilipHazel commented 1 year ago

My "threshold" suggestion is just an automatic way of implementing your last point - though no need to recreate because that will happen automatically at the next match.

carenas commented 1 year ago

FWIW went ahead and implemented a proposal to fix libselinux based on the first two points and the results look encouraging even if the code is not finished and my setup is not ideal.

Running heaptrack restorecon -vRn /usr with the different versions of the code in Debian 12 shows:

HEAD (includes workaround but not enabled) + system pcre2:

total runtime: 43.16s.
calls to allocation functions: 469370 (10876/s)
temporary memory allocations: 215204 (4986/s)
peak heap memory consumption: 10.11M
peak RSS (including heaptrack overhead): 22.62M
total memory leaked: 1.81K

proposal + system pcre2:

total runtime: 32.14s.
calls to allocation functions: 463711 (14428/s)
temporary memory allocations: 215203 (6696/s)
peak heap memory consumption: 9.27M
peak RSS (including heaptrack overhead): 21.77M
total memory leaked: 1.90K

proposal + new pcre2:

total runtime: 29.02s.
calls to allocation functions: 464810 (16015/s)
temporary memory allocations: 215191 (7414/s)
peak heap memory consumption: 2.47M
peak RSS (including heaptrack overhead): 7.36M
total memory leaked: 22.40K

There might be a bug somewhere (maybe in heaptrack, because ASAN doesn't see it) as it seems we leaked a whole heapframe, but otherwise the improvements are significant (using new pcre2):

HEAD

total runtime: 39.96s.
calls to allocation functions: 536789 (13433/s)
temporary memory allocations: 214446 (5366/s)
peak heap memory consumption: 59.57M
peak RSS (including heaptrack overhead): 20.59M
total memory leaked: 1.91K

HEAD (workaround enabled)

total runtime: 226.04s.
calls to allocation functions: 509162297 (2252521/s)
temporary memory allocations: 254421578 (1125555/s)
peak heap memory consumption: 2.63M
peak RSS (including heaptrack overhead): 7.79M
total memory leaked: 1.91K

PhilipHazel commented 1 year ago

Well, if there's nothing for me to do, I am very happy to do it.

iskim517 commented 1 year ago

I tried the proposal and the result looks promising!

PhilipHazel commented 2 months ago

Has this issue gone away?

PCRE2Project / pcre2

Refactoring on match_data may increase regression on peak memory usage #194