VectorCamp / vectorscan

A portable fork of the high-performance regular expression matching library
https://www.vectorcamp.gr/project/vectorscan/
Other
495 stars 53 forks source link

Allow to cancel hs_scan*() #139

Open rschu1ze opened 1 year ago

rschu1ze commented 1 year ago

We (ClickHouse) recently encountered some patterns which are extremely expensive to evaluate with vector/hyperscan, for example bounded repeats "x{n,m}" (these are also documented as being expensive). As a mitigation, we now check patterns on a best-effort basis and reject them when they will likely be expensive.

A better solution would be to either

EDIT: Just noticed that pattern compilation, i.e. hs_compile_multi(), becomes slow (not: the scan). A callback for canceling hs_compile_*() would be great.

(*) ClickHouse actually only uses block mode, not streaming or vector modes.

markos commented 1 year ago

Hi @rschu1ze we can provide the second method, but it will go in the next version, this one (5.4.9) needs to be released asap, it's already overdue.

rschu1ze commented 1 year ago

That would be awesome, thanks :)

markos commented 1 year ago

We need to release 5.4.10 asap, so this is moved to next version, however this will not take that long as we have increased our resources in this project.

markos commented 9 months ago

@rschu1ze we will begin development of this feature now. As explained in the Readme, due to the recent closed-sourcing of original hyperscan project for versions >5.4, we will continue to keep compatibility with this version, but we will not pursue compatibility with later IPL hyperscan versions. This is actually a good thing for us, as it allows us to extend functionality without needing to chase the original project anymore.

Now, with regards to this problem, we intend to add a few more hsscan*_extended() functions that can do things that the original API does not provide, but without changing the original API.

We will start with adding another periodic callback function as you called it, with a user provided period. Is there anything else that you would like to add in this, now that we're still in the design phase?

rschu1ze commented 8 months ago

@markos Sorry for not checking back earlier.

New functions hs_scan_*_extended() would be fine for us (and I understand your motivation of not breaking existing use cases). But we would also be fine with extending hs_scan_*() itself, e.g. in a new API-incompatible major version.

We will start with adding another periodic callback function as you called it, with a user provided period.

Sounds good, looking forward to this. The only addition I would have is that pattern compilation is also prone to ReDoS attacks, meaning that a similar mechanism in hs_compile_*() would be helpful.