chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

ValueError: [E088] #321

Closed thebennos closed 3 years ago

thebennos commented 3 years ago

Textacy is installad in a docker image. texacy get jobs by a message queue.

Traceback (most recent call last): File "app.py", line 194, in rmq_receive() File "app.py", line 172, in rmq_receive channel.start_consuming() File "/usr/local/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1865, in start_consuming self._process_data_events(time_limit=None) File "/usr/local/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 2026, in _process_data_events self.connection.process_data_events(time_limit=time_limit) File "/usr/local/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 833, in process_data_events self._dispatch_channel_events() File "/usr/local/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 567, in _dispatch_channel_events impl_channel._get_cookie()._dispatch_events() File "/usr/local/lib/python3.7/site-packages/pika/adapters/blocking_connection.py", line 1493, in _dispatch_events evt.properties, evt.body) File "app.py", line 62, in callback textacy_doc = textacy.make_spacy_doc(text, lang="de_core_news_lg") File "/usr/local/lib/python3.7/site-packages/textacy/spacier/core.py", line 161, in make_spacy_doc return _make_spacy_doc_from_text(data, lang) File "/usr/local/lib/python3.7/site-packages/textacy/spacier/core.py", line 192, in _make_spacy_doc_from_text return spacy_lang(text) File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 437, in call doc = self.make_doc(text) File "/usr/local/lib/python3.7/site-packages/spacy/language.py", line 465, in make_doc Errors.E088.format(length=len(text), max_length=self.max_length) ValueError: [E088] Text of length 1072819 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

the jobs has

8490 sentences 141474 words.

Directly in spacy i would do this nlp = en_core_web_sm.load() nlp.max_length = len(text)

but with texacy this does not work. Possible the to adjust the max_length with textacy?

Sorry from here, maybe I do not see the correct solution. Too much trees!!

bdewilde commented 3 years ago

Hi @thebennos , when you make a doc and refer to the spaCy language pipeline by its string name — textacy.make_spacy_doc(text, lang="de_core_news_lg") — you automatically load that pipeline under the hood using its default configuration. However, you can load the pipeline as an object, modify its properties in-place, then refer back to it by name, and those properties will still be modified:

In [1]: import textacy
In [2]: textacy.make_spacy_doc("this is a short text.", lang="en_core_web_sm")
Out[2]: this is a short text.
In [3]: lang = textacy.load_spacy_lang("en_core_web_sm")
In [4]: lang.max_length = 10
In [5]: textacy.make_spacy_doc("this is a short text.", lang="en_core_web_sm")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-1478fa5c920e> in <module>
----> 1 textacy.make_spacy_doc("this is a short text.", lang="en_core_web_sm")

...

~/.pyenv/versions/textacy-spacy3/lib/python3.9/site-packages/spacy/language.py in make_doc(self, text)
   1055         """
   1056         if len(text) > self.max_length:
-> 1057             raise ValueError(
   1058                 Errors.E088.format(length=len(text), max_length=self.max_length)
   1059             )

ValueError: [E088] Text of length 21 exceeds maximum of 10. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.