darvid / python-hyperscan

🐍 A CPython extension for the Hyperscan regular expression matching library.
https://python-hyperscan.readthedocs.io/en/latest/
MIT License
165 stars 28 forks source link

Segmentation Fault error when number of patterns reaches ~520000 #25

Closed dominusmi closed 3 years ago

dominusmi commented 3 years ago

Hello, First of all thank you for making this library, it has been very useful!

I seem to have found a segmentation fault bug. The database compile function breaks when the number of patterns provided reaches somewhere between 522000 and 523000 (on my machine at least). This number is consistent and always the same, and does not seem to vary with respect to the complexity of the patterns. Note that this is not an issue when using the C implementation directly, nor is it a RAM / out of memory related problem.

Here is a minimal reproducible example:

import hyperscan
import numpy as np

db = hyperscan.Database()
n = 521000
# generate `n` patterns, each of 4 bytes written in hex format (e.g. \\x23\\xff\\x3d\\xab) and encodes 
# in bytes using utf-8 which is identical to ascii given the each character's order is lower than 128
expressions = ["\\x{:02x}\\x{:02x}\\x{:02x}\\x{:02x}".format(*list(np.random.randint(0,256,4,np.uint8))).encode("utf-8") for i in range(n)]

db.compile(expressions=expressions)

I've looked at the source code and gave it a go, however I'm extremely unfamiliar with the C-python connection API, and did not get very far.