Closed carenas closed 1 year ago
I had also thought of something similar, though instead of a new field match_data_foreign, I had thought of inventing a new option for regcomp. Actually, my thought wasn't very similar because I was going to have malloc/free for each match when the option was set (or not set, depending on the default). Your approach has better performance, but it requires the calling program to #include pcre2.h and invoke some native functions -- if someone is going to do that they might as well call the native compile and match, mightn't they? I agree that the documentation needs updating whether or not the code is changes.
I had thought of inventing a new option for regcomp
maybe it will make sense to do both, and that way avoid the extra flags I was adding to regex_t.
one advantage of doing this is that it can be also generalized for other use cases, like for example allowing JIT which is also not threading safe.
they might as well call the native compile and match
agree, and this is also why I am not sure if it is worth the trouble, but guess we don't know why they chose to use the POSIX api to begin with, considering it is more limited and doesn't even allow for using JIT.
so my assumption is that they probably already have a codebase that uses the POSIX API, and had found that performance is terrible (macOS TRE based POSIX API, is an example[1]) so they are using this as a stop gap until they can fully migrate to the native API, and in that case adding the native API header and using a few native calls (under some ifdef so they can still use the system provided POSIX API as a fallback) might not be an issue.
[1] https://lore.kernel.org/git/dw26ovefqhtvpdxdrj6rlzmosi5wg2avhsurh452jku3vf2ox3@fjkysuygd4ej/
I have a solution that could make pcre2posix thread-safe while preserving performance in the long run:
After a while, each thread will have a pcre2_match_data large enough to accommodate all executed regexs.
Sure, but as I pointed out in the proposed documentation change, you might want to monitor each threads' match data heap usage in the long run and recreate it if they are inefficiently sized.
The best solution would be to use the native API (in a similar way) instead IMHO, and you can even improve performance by using JIT; @ChezBunch why that is not an option?, and if you are stuck with the POSIX API, is there anything else we can add to it that would help?
We'll use the native API, so that's not an issue for us. But I think the pcre2posix layer is presented as a drop-in replacement to the system API while it's not because of the thread-safe issue. I would prefer others not to suffer like we suffered while debugging this issue in an heavily multi-threaded program. Regarding heap size, it depends on the number of compiled regexs and the number of threads. It would use way less memory than now if you have a lot of compiled regex and few threads. Of course, it's a trade-off, it could be worse for a few regexs and a lot of threads.
pcre2posix layer is presented as a drop-in replacement
can you help us identify where this misrepresentation is?, the documentation only says it is "POSIX-style" because the name of the functions are similar and they could be aliased, but it also says that the regular expression syntax used is PCRE2 and all the functions are doing is calling the native API for you.
I allways though it was funny we even decided to have error codes that won't match what the system API uses, and obviously our regcomp
. knows of a lot more flags than POSIX, some of them suspiciously PCRE2 looking.
There is even a whole paragraph that starts with "PCRE2 never intended to be a POSIX engine", and as I mentioned earlier, we are not even POSIX compliant because of the offset sizes. Frankly, the threading issue, while unfortunate and also a regression since 10.41, is just another bug for this rarely used set of functions.
Alright, just document the functions as thread-unsafe then.
Oh, and the "misconception", as you call it, comes from this: https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2posix.h#L168 It's not "similar" names that "could be aliased", you explicitly provided those aliases.
you explicitly provided those aliases
that is a problematic "feature" which I think had to be added for backward compatibility with the old PCRE, so that might be more difficult to change than documentation.
@PhilipHazel: I think that it will be better to close this PR and maybe use some of the documentation suggestions in a new one, it might be worth doing your "fix" though which is also less intrusive.
let me know if you would want me to take a shot at it, but either way your input is likely to be needed with the documentation part.
we might want to add the user suggestion to the "TODO" list, and might replace or be an option on top of yours, but obviously will require threads (which are only available in C11, and are not very widely implemented yet).
I regret my incompetence in not realizing the thread implications of the change in 10.41 that caused this. Ironically, the reason for making the underlying change in the native API is listed as this: "9. Removed the use of an initial backtracking frames vector on the system stack in pcre2_match() so that it now always uses the heap. (In a multi-thread environment with very small stacks there had been an issue.) " That is, this threading issue was provoked by a different threading issue.
The original addition of regcomp and regexec wrappers to PCRE1 was at somebody's request, to make it easier for programs that already used that interface to switch over to PCRE. I have been persuaded over the years to add (as @carenas says) some non-POSIX flags. Perhaps I should have been more resistant.
I will work on the documentation to point out prominently that the interface now is not thread safe. Perhaps that is all that is needed.
I still don't understand both your remarks on the non thread-safety introduced in 10.41. We were previously using 10.39 version, and it was not thread-safe either. The pcre2_match_data was already embedded in the regex_t object and shared among threads. Since it's modified while the matching is performed, I can't see how it could be thread-safe. Anyway, the 10.39 version did not crash as badly as the 10.42, it just returned returned erroneous matching results (or bad offset in rm_so/rm_eo).
Interesting, I was incorrect then attributing the thread safety issues to the the change introduced with 10.41. If nothing, it might had been a net positive, since it made the thread safety problems obvious because it will cause crashes.
When looking at the logs (which made me found another bug, addressed in #333), there were comments about making the code thread safe, but guess it never was and so the thread safety bug is not even a regression.
Indeed, I realized later that the reference to 10.41 was a red herring. In PCRE2, the POSIX interface has never been thread safe. From the first release, 10.00, it has always retained a match data block. I am now somewhat amazed that it has taken nearly 9 years (10.00 was released in January 2015) for this to have caused a problem. However, I think this fact means that putting big warnings all over the documentation is all that now needs to be done.
An attempt to improve the situation in #318, although incomplete and not sure if worth the trouble.
Posted as a draft for discussion, and maybe as a basis for the documentation changes that will be needed eitherway.