YunoHost-Apps / paperless-ngx_ynh

Scan, index and archive all your physical documents
Other
12 stars 12 forks source link

PDF uploads fail on version 2.11.6~ynh1 #123

Closed CodeShakingSheep closed 1 month ago

CodeShakingSheep commented 2 months ago

Describe the bug

It's the exact same bug as described here https://github.com/paperless-ngx/paperless-ngx/issues/7519, except that I'm on version 2.11.6~ynh1.

Context

Steps to reproduce

Just upload any PDF file to paperless-ngx.

Expected behavior

PDF file should be uploaded successfully without error.

Logs

[2024-08-31 22:07:53,260] [ERROR] [paperless.tasks] ConsumeTaskPlugin failed: 2024-05 Ppc N.pdf: The following error occurred while storing document 2024-05 Ppc N.pdf after parsing:

**********************************************************************

  Resource punkt_tab not found.

  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk

  >>> nltk.download('punkt_tab')

  

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:

    - PosixPath('/var/www/paperless-ngx/nltk_data')

**********************************************************************

Traceback (most recent call last):

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/asgiref/sync.py", line 327, in main_wrap

    raise exc_info[1]

  File "/var/www/paperless-ngx/src/documents/consumer.py", line 670, in run

    document_consumption_finished.send(

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/django/dispatch/dispatcher.py", line 176, in send

    return [

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/django/dispatch/dispatcher.py", line 177, in <listcomp>

    (receiver, receiver(signal=self, sender=sender, **named))

  File "/var/www/paperless-ngx/src/documents/signals/handlers.py", line 95, in set_correspondent

    potential_correspondents = matching.match_correspondents(document, classifier)

  File "/var/www/paperless-ngx/src/documents/matching.py", line 37, in match_correspondents

    pred_id = classifier.predict_correspondent(document.content) if classifier else None

  File "/var/www/paperless-ngx/src/documents/classifier.py", line 413, in predict_correspondent

    X = self.data_vectorizer.transform([self.preprocess_content(content)])

  File "/var/www/paperless-ngx/src/documents/classifier.py", line 386, in preprocess_content

    words: list[str] = word_tokenize(

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 142, in word_tokenize

    sentences = [text] if preserve_line else sent_tokenize(text, language)

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 119, in sent_tokenize

    tokenizer = _get_punkt_tokenizer(language)

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 105, in _get_punkt_tokenizer

    return PunktTokenizer(language)

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__

    self.load_lang(lang)

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang

    lang_dir = find(f"tokenizers/punkt_tab/{lang}/")

  File "/var/www/paperless-ngx/venv/lib/python3.9/site-packages/nltk/data.py", line 579, in find

    raise LookupError(resource_not_found)

LookupError:

**********************************************************************

  Resource punkt_tab not found.

  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk

  >>> nltk.download('punkt_tab')

  

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:

    - PosixPath('/var/www/paperless-ngx/nltk_data')

**********************************************************************

The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File "/var/www/paperless-ngx/src/documents/tasks.py", line 149, in consume_file

    msg = plugin.run()

  File "/var/www/paperless-ngx/src/documents/consumer.py", line 733, in run

    self._fail(

  File "/var/www/paperless-ngx/src/documents/consumer.py", line 304, in _fail

    raise ConsumerError(f"{self.filename}: {log_message or message}") from exception

documents.consumer.ConsumerError: 2024-05 Ppc N.pdf: The following error occurred while storing document 2024-05 Ppc N.pdf after parsing:

**********************************************************************

  Resource punkt_tab not found.

  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk

  >>> nltk.download('punkt_tab')

  

  For more information see: https://www.nltk.org/data.html

  Attempted to load tokenizers/punkt_tab/english/

  Searched in:

    - PosixPath('/var/www/paperless-ngx/nltk_data')

**********************************************************************
eldertek commented 2 months ago

Same here !

CodeShakingSheep commented 1 month ago

Works again after upgrade to version 2.11.6~ynh2. Closing