darvid / python-hyperscan

šŸ A CPython extension for the Hyperscan regular expression matching library.
https://python-hyperscan.readthedocs.io/en/latest/
MIT License
165 stars 28 forks source link

Memory leak in Database object when compiling, dumping and loading. #46

Closed BartVerc closed 1 year ago

BartVerc commented 2 years ago

Thanks for the great library. We discovered a memory leak when a Database object is compiled, dumped, and loaded multiple times. This is increasingly worse when the database is growing larger. Even creating a new Database object every time will result in a leak.

Compiling one pattern in the same Database object:

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])

case1

Creating a new Database object every compile, slows the leak down.

import hyperscan as hs

for i in range(100000):
    db = hs.Database(mode=hs.HS_MODE_BLOCK)
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])

case2

But when the Database object is dumped (and loaded as well) it speeds the memory usage up.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
    b = hs.dumpb(db)

case3

I tried to dig into the C code to find the problem, but I do not have enough expertise to solve the issue. Can someone help or point me in the right direction. Thanks!

Ivaylo-Korakov commented 2 years ago

I am experiencing the same issue but unfortunately also not a C Wizzard.

darvid commented 1 year ago

this should be fixed in the latest release, but please confirm when you get a chance before I close this ticket.

apologies for the delay

BartVerc commented 1 year ago

Hi @darvid , I was just testing this with the latest version available from pypi, which seems to be v0.3.2. So I wanted to let you know v0.3.3 is not published on pypi yet :)

BartVerc commented 1 year ago

Hi @darvid, thanks for the memory leak fix. It resolves our first case, where we only compile the database in the loop X times.

However, when we are also dumping the database in the loop we see that after some iterations the memory starts growing.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)
for i in range(100000):
    db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
    b = hs.dumpb(db)

The memory also starts growing if you load bytes into a database in the same variable as the last one.

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)

db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)

for i in range(100000):
    db = hs.loadb(b)

As we use the same variable over again I expect Python to clean up the memory. I can sent you the plots if you're interested. Thanks again for looking into it!

darvid commented 1 year ago

ahhhh looks like the leak is with scratch not being freed now. here's a quick workaround while I push and release a fix:

import hyperscan as hs

db = hs.Database(mode=hs.HS_MODE_BLOCK)

db.compile(expressions=[b'test'], ids=[1], flags=[hs.HS_FLAG_ALLOWEMPTY])
b = hs.dumpb(db)

for i in range(100000):
    del db.scratch  # dealloc scratch anytime you decref db
    db = hs.loadb(b)
Ivaylo-Korakov commented 1 year ago

You are a wizard @darvid :smile:

BartVerc commented 1 year ago

Thank you very much!

BartVerc commented 1 year ago

@darvid , we missed a segmentation fault with the code you suggested. It was omitted by the memory profiler. Investigating further..

darvid commented 1 year ago

yeah looks like the source of the issue is the create_scratch conditional scratch creation in loadb. The actual workaround is calling loadb like this:

hs.loadb(db, create_scratch=False)

I think I'll just remove that parameter, because calling compile will implicitly create a scratch if it's set to None anyways.

darvid commented 1 year ago

let's keep this issue open until i push a release

BartVerc commented 1 year ago

It's leaking less memory with hs.loadb(b, create_scratch=False), around half of it, but it's still leaking. I will keep digging into it.

darvid commented 1 year ago

hey, this should be fixed in the commit above, but i'm running into some annoying CI issues that have been taking a while to fix and prevent releasing to pypi (i'd rather fix the release pipeline than force publish by hand)

until then you can try to build from master, there's also a dockerfile that should be up-to-date and help with compiling hyperscan. hopefully i'll have the build issues fixed asap

BartVerc commented 1 year ago

This solves the issues, thanks a lot!

darvid commented 1 year ago

0.4.0 now on pypi, closing for now, hopefully no more memory leaks šŸ˜…

darvid commented 1 year ago

also really sorry for the inconvenience if you cloned the repo locally, i did a bunch of history rewriting and commit amending to reduce noise in the commit log while debugging the build process, so you may need to fresh clone next time.