pcre2_regexec is not thread-safe

ChezBunch commented 1 year ago

Traditionally regexec is thread-safe, meaning once a given regex is compiled into a regex_t, it can be used simultaneously in different threads.

But pcre2_regexec is not thread-safe: since regex_t contains a pcre2_match_data and that object is modified during matching, two threads cannot execute the same regex_t at the same time.

We were bitten hard by this limitation in a heavily threaded program (various regexs failing unexpectedly and even some SIGSEGV), since we did not expect such a behavior:

The pcre2posix documentation is silent on the matter
The libc version used previously was thread-safe
pcre2_regexec expecting a pointer to a const regex_t * seems to imply that the POSIX API is thread-safe regarding to a compiled regex
The native API is thread-safe regarding to the matching

Should the POSIX layer be modified to be thread-safe? Should the pcre2posix documentation be updated?

In the meantime, we modified our pcre2_regexec version to allocate and free a pcre2_match_data object on each invocation, but this means some performances degradation. Another option would be to embed a mutex in the regex_t object.

PhilipHazel commented 1 year ago

Oh dear. The POSIX wrapper was always an "added extra" for PCRE, trying to map the POSIX API onto PCRE's native one. I suppose (I can't remember) that attaching a match_data object to the compiled regex seemed like a good way of saving a malloc/free for each match, and it was probably me that didn't think about multithreading. I do not like the idea of using a mutex because all the PCRE2 code should be simple Standard C for use in many environments. Allocating and freeing match_data obviously works, and I cannot see any other way of making this thread safe, My inclination is to document the fact that it is not thread safe, and encourage multithreading applications to use the native API, but what does anybody else think?

carenas commented 1 year ago

any other way of making this thread safe

something I recommended before for multithreaded use was to allocate the match data as a thread local and reuse it on each thread. The native API makes that easy to do because the match data is a parameter for the match function, so maybe providing a way to give that to the "POSIX" API might help as well.

I might be wrong, but AFAIK regex_t is an opaque type so that might be feasible without changing the API, although it won't be backward compatible and would be awkward to use, since the match data needs to match the number of pairs that the expression needs, or at least have more available that what the matching function will use.

use the native API,

isn't this problem also present in there though, because of the heapframes for the interpreter since 10.41 if the match_data is shared?

I think another "gotcha" that is also undocumented is the fact that our regoff_t is an int (probably because it was what GNU did, and because pcre1 also used that for its offset), pcre2 uses PCRE2_SIZE (aka size_t) which would be a better fit for POSIX 1003.1-2008, but we have no way to change that as of now, and therefore the wrapper is not really compatible with that version of the standard AFAIK.

PhilipHazel commented 12 months ago

The documentation now states that the POSIX functions are not thread-safe, so I am closing this issue.

carenas commented 12 months ago

While I agree this is "closed", I think the current "solution" is suboptimal.

I assume there might be some people out there that would rather have this interface to be really POSIX compatible (including of course being thread safe) to use it, or NOT having the option to use it, since apparently "assuming" it should be POSIX compatible just because of the name seems to be common.

PhilipHazel commented 12 months ago

I think I am beginning to regret having introduced this wrapper.

carenas commented 12 months ago

regret

maybe for pcre3 we can remove it, and make it an independent library that uses PCRE, then it can be made to be C11 or higher and therefore making it POSIX compatible would be very simple.

Indeed, it could even had some backward compatibility layer all the way to C89, GNU89 or POSIX if needed.

I am sure someone might even argue it is a fun way to learn some rust ;)

MatthewVernon commented 12 months ago

maybe for pcre3

Pre-emptively, can I strongly suggest never releasing anything called pcre3 and skipping straight to pcre4?

For historical reasons relating to sonames, old-pcre ended up being called pcre3 in Debian (and thus all Debian's derivatives, including Ubuntu), so re-using that name would cause no end of confusion.

carenas commented 12 months ago

in Debian

wasn't Debian migrating every single PCRE package to use PCRE2?, hopefully by the time we prepare the first PCRE3 release that should be completed, and therefore no risk for confusion.

BTW, something that might be also a good idea for that version, might be adding version numbers to the functions and so the "need" of add a version number to the library name could be avoided alltogether.

curious though why "3" was selected for PCRE1 and "2" for PCRE2, that might be the biggest source of confusion IMHO.

PhilipHazel commented 12 months ago

One thing is certain: it won't be me that puts out pcre3.

MatthewVernon commented 12 months ago

in Debian

wasn't Debian migrating every single PCRE package to use PCRE2?, hopefully by the time we prepare the first PCRE3 release that should be completed, and therefore no risk for confusion.

We are, but it's proving quite a slow process :( And also, pcre3 will remain in the archives until at least the last Debian release it was in is older than oldoldstable (i.e. quite a long time!)

curious though why "3" was selected for PCRE1 and "2" for PCRE2, that might be the biggest source of confusion IMHO.

Because the then package maintainer was putting the soname in the package name (like we still do for e.g. libc6), and we'd got to 3 before pcre2 was released.

PCRE2Project / pcre2

pcre2_regexec is not thread-safe #318