PCRE2Project / pcre2

PCRE2 development is now based here.
Other
918 stars 191 forks source link

multi threading #522

Open jmbnyc opened 1 month ago

jmbnyc commented 1 month ago

How are you suppose to allow a compiled regex pattern to be matched from multiple threads concurrently.

If I have a jit stack that is thread local but the pcre2_match_context is effectively tied to the pattern then I believe this call will cause problems:

pcre2_jit_stack_assign because it modifies the context.

Can you advise the best way to have a compiled pattern concurrently matched without locking?

jmbnyc commented 1 month ago

I think I figured this out by reading the code. All params to the jit must be thread local, match data, match context and the jit stack that is attached to the match context. Please confirm. It would be nice in the MT section if this was made extremely clear because although it seems to make sense now it was not clear before I read the code and observed how match context was being used (as a conduit for some params).

carenas commented 1 month ago

All params to the jit must be thread local

For your setup, the pcre2_jit_compile() call itself should be done on each thread as explained in the documentation.

Additionally, if you use a custom thread stack then that needs to be assigned to each thread independently as explained in the JIT documentation.

The match_data (which could be reused by multiple serial calls to pcre2_match() in the same thread cannot be shared between threads.

ltrzesniewski commented 1 month ago

The pcre2_jit_compile() call itself must be done on each thread as explained in the documentation.

Are you sure? That's not how I understand the documentation: it says to lock the pcre2_code* before calling pcre2_jit_compile, or essentially make sure it's JIT compiled only once.

carenas commented 1 month ago

Are you sure?

Was thinking that the fact the information is all spread around is confusing, specially as the interpreter is lock free and thread safe, and will only need a mutex when the patterns are compiled on demand.

JIT uses a mutex internally (allthough that can be configurable as well) for its memory allocation of executable code, and uses a stack at match time that can't be shared between threads. and creates non PIC code so it will need to be called again in a pcre2_code that was created from pcre_code_copy().

Maybe we need a pcre2thread.3 man page.

FWIW you CAN safely call most of the time pcre_jit_compile() in the same pcre2_code more than once, and indeed you are encouraged to do so in the documentation as well. and you could call pcre2_jit_compile() only once as far as you make sure that each thread uses a different JIT stack at match time, which needs to be done implicitly and is more tricky to get right.

zherczeg commented 1 month ago

I am sorry if it is confusing. You need to compile the jit code only once, with pcre_jit_compile(). That is non-thread safe, but you can do right after the normal compilation, to avoid parallel compilation.

As for matching, your second comment is also correct. You can run the code in parallel in any threads, but you need a separate match data and jit stack. I don't think you need a unique match context.

The "CONTROLLING THE JIT STACK" section here gives you more info: https://www.pcre.org/current/doc/html/pcre2jit.html

jmbnyc commented 1 month ago

My issue was that I was doing

:pcre2_jit_stack_assign(_pMatchContext, nullptr, pJitStack)

just before calling the match function. In the above, my code had _pMatchContext as an instance variable associated with a pcre2_code object and pJitStack is thread local.

This causes a problem if another thread uses the regex object and does the same thing concurrently (_pMatchContext would be same but pJitStack would be a different thread local).

Thus, the fix was to have a thread local match context and assign the jit stack one time at thread local create time.