curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Regex timeout exception on processing document text #46

Closed v-echo closed 3 years ago

v-echo commented 3 years ago

Describe the bug Processing a slightly larger document text throws a RegexTimeoutException.

To Reproduce Extract the text from the attached document. Run a process pipeline against the extracted text.

Expected behavior The text should be processed correctly and no exception should be thrown.

Screenshots err_regex

Additional context Document used as a source (extracted with Apache Tika) is attached. Also linked is the extracted text. Cillian_Murphy.pdf https://paste.ee/p/trAmx

Edit: the bug is not a showstopper. It seems that it's internally handled somewhere, but I still wonder if it's normal behavior.

theolivenbaum commented 3 years ago

Thanks for reporting it! The regex timeout is a failsafe against getting stuck in edge conditions. Never hit it before but great to have a way to reproduce it locally - I'll push tomorrow a fix for it (and check how to better handle it too)

theolivenbaum commented 3 years ago

@v-echo thanks for the data to reproduce it - was easy to narrow down to this URL regex test. We already had a timeout there to avoid the catastrophic backtracking from the regex, but I added a new test now that I see what's causing it - and that should fix the exception being thrown in most cases.

v-echo commented 3 years ago

It's not a problem, glad to help.