blacklanternsecurity / bbot

A recursive internet scanner for hackers.
https://www.blacklanternsecurity.com/bbot/
GNU General Public License v3.0
4.7k stars 424 forks source link

Unstructured error: NLTK Resource "punkt_tab" not found. #1651

Closed TheTechromancer closed 2 months ago

TheTechromancer commented 2 months ago

Recent bug in unstructured is preventing tests from passing:

2024-08-11T22:04:35.5170317Z ERROR    bbot.scanner:scanner.py:1195 Error in unstructured.handle_event(FILESYSTEM("{'path': '/tmp/.bbot_test/scans/testunstructured_test_5866686nbn/filedownload/20...", module=filedownload, tags={'file', 'filedownload', 'in-scope'})): /home/runner/work/bbot/bbot/bbot/modules/unstructured.py:100:handle_event(): 
2024-08-11T22:04:35.5170525Z **********************************************************************
2024-08-11T22:04:35.5170687Z   Resource punkt_tab not found.
2024-08-11T22:04:35.5170880Z   Please use the NLTK Downloader to obtain the resource:
2024-08-11T22:04:35.5170888Z 
2024-08-11T22:04:35.5171007Z   >>> import nltk
2024-08-11T22:04:35.5171150Z   >>> nltk.download('punkt_tab')
2024-08-11T22:04:35.5171247Z   
2024-08-11T22:04:35.5171440Z   For more information see: https://www.nltk.org/data.html
2024-08-11T22:04:35.5171454Z 
2024-08-11T22:04:35.5171681Z   Attempted to load tokenizers/punkt_tab/english/
2024-08-11T22:04:35.5171688Z 
2024-08-11T22:04:35.5171779Z   Searched in:
2024-08-11T22:04:35.5171916Z     - '/home/runner/nltk_data'
2024-08-11T22:04:35.5172236Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/nltk_data'
2024-08-11T22:04:35.5172588Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/share/nltk_data'
2024-08-11T22:04:35.5172916Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/nltk_data'
2024-08-11T22:04:35.5173041Z     - '/usr/share/nltk_data'
2024-08-11T22:04:35.5173178Z     - '/usr/local/share/nltk_data'
2024-08-11T22:04:35.5173298Z     - '/usr/lib/nltk_data'
2024-08-11T22:04:35.5173427Z     - '/usr/local/lib/nltk_data'
2024-08-11T22:04:35.5173571Z **********************************************************************
2024-08-11T22:04:35.5173577Z 
2024-08-11T22:04:35.5173879Z TRACE    bbot.scanner:scanner.py:1196 concurrent.futures.process._RemoteTraceback: 
2024-08-11T22:04:35.5173965Z """
2024-08-11T22:04:35.5174075Z Traceback (most recent call last):
2024-08-11T22:04:35.5174605Z   File "/opt/hostedtoolcache/Python/3.9.19/x64/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
2024-08-11T22:04:35.5174792Z     r = call_item.fn(*call_item.args, **call_item.kwargs)
2024-08-11T22:04:35.5175188Z   File "/home/runner/work/bbot/bbot/bbot/modules/unstructured.py", line 152, in extract_text
2024-08-11T22:04:35.5175330Z     elements = partition(filename=file_path)
2024-08-11T22:04:35.5176078Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/auto.py", line 341, in partition
2024-08-11T22:04:35.5176192Z     elements = partition_pdf(
2024-08-11T22:04:35.5176968Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/documents/elements.py", line 605, in wrapper
2024-08-11T22:04:35.5177087Z     elements = func(*args, **kwargs)
2024-08-11T22:04:35.5177930Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
2024-08-11T22:04:35.5178046Z     elements = func(*args, **kwargs)
2024-08-11T22:04:35.5178794Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
2024-08-11T22:04:35.5178910Z     elements = func(*args, **kwargs)
2024-08-11T22:04:35.5179632Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
2024-08-11T22:04:35.5179743Z     elements = func(*args, **kwargs)
2024-08-11T22:04:35.5180487Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 210, in partition_pdf
2024-08-11T22:04:35.5180664Z     return partition_pdf_or_image(
2024-08-11T22:04:35.5181472Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 357, in partition_pdf_or_image
2024-08-11T22:04:35.5181678Z     out_elements = _process_uncategorized_text_elements(elements)
2024-08-11T22:04:35.5182606Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/pdf.py", line 966, in _process_uncategorized_text_elements
2024-08-11T22:04:35.5182816Z     new_el = element_from_text(cast(Text, el).text)
2024-08-11T22:04:35.5183590Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/text.py", line 295, in element_from_text
2024-08-11T22:04:35.5183718Z     elif is_possible_narrative_text(text):
2024-08-11T22:04:35.5184568Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 80, in is_possible_narrative_text
2024-08-11T22:04:35.5184742Z     if exceeds_cap_ratio(text, threshold=cap_threshold):
2024-08-11T22:04:35.5185548Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 276, in exceeds_cap_ratio
2024-08-11T22:04:35.5185661Z     if sentence_count(text, 3) > 1:
2024-08-11T22:04:35.5186454Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/partition/text_type.py", line 225, in sentence_count
2024-08-11T22:04:35.5186564Z     sentences = sent_tokenize(text)
2024-08-11T22:04:35.5187308Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/unstructured/nlp/tokenize.py", line 137, in sent_tokenize
2024-08-11T22:04:35.5187414Z     return _sent_tokenize(text)
2024-08-11T22:04:35.5188138Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
2024-08-11T22:04:35.5188268Z     tokenizer = PunktTokenizer(language)
2024-08-11T22:04:35.5188964Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1744, in __init__
2024-08-11T22:04:35.5189063Z     self.load_lang(lang)
2024-08-11T22:04:35.5189762Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/nltk/tokenize/punkt.py", line 1749, in load_lang
2024-08-11T22:04:35.5189920Z     lang_dir = find(f"tokenizers/punkt_tab/{lang}/")
2024-08-11T22:04:35.5190537Z   File "/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/python3.9/site-packages/nltk/data.py", line 582, in find
2024-08-11T22:04:35.5190678Z     raise LookupError(resource_not_found)
2024-08-11T22:04:35.5190769Z LookupError: 
2024-08-11T22:04:35.5190911Z **********************************************************************
2024-08-11T22:04:35.5191070Z   Resource punkt_tab not found.
2024-08-11T22:04:35.5191326Z   Please use the NLTK Downloader to obtain the resource:
2024-08-11T22:04:35.5191332Z 
2024-08-11T22:04:35.5191452Z   >>> import nltk
2024-08-11T22:04:35.5191590Z   >>> nltk.download('punkt_tab')
2024-08-11T22:04:35.5191692Z   
2024-08-11T22:04:35.5191881Z   For more information see: https://www.nltk.org/data.html
2024-08-11T22:04:35.5191888Z 
2024-08-11T22:04:35.5192117Z   Attempted to load tokenizers/punkt_tab/english/
2024-08-11T22:04:35.5192122Z 
2024-08-11T22:04:35.5192210Z   Searched in:
2024-08-11T22:04:35.5192340Z     - '/home/runner/nltk_data'
2024-08-11T22:04:35.5192662Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/nltk_data'
2024-08-11T22:04:35.5193010Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/share/nltk_data'
2024-08-11T22:04:35.5193342Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/nltk_data'
2024-08-11T22:04:35.5193467Z     - '/usr/share/nltk_data'
2024-08-11T22:04:35.5193660Z     - '/usr/local/share/nltk_data'
2024-08-11T22:04:35.5193791Z     - '/usr/lib/nltk_data'
2024-08-11T22:04:35.5193924Z     - '/usr/local/lib/nltk_data'
2024-08-11T22:04:35.5194067Z **********************************************************************
2024-08-11T22:04:35.5194072Z 
2024-08-11T22:04:35.5194150Z """
2024-08-11T22:04:35.5194265Z 
2024-08-11T22:04:35.5194508Z The above exception was the direct cause of the following exception:
2024-08-11T22:04:35.5194513Z 
2024-08-11T22:04:35.5194630Z Traceback (most recent call last):
2024-08-11T22:04:35.5194966Z   File "/home/runner/work/bbot/bbot/bbot/scanner/scanner.py", line 1172, in _acatch
2024-08-11T22:04:35.5195055Z     yield
2024-08-11T22:04:35.5195358Z   File "/home/runner/work/bbot/bbot/bbot/modules/base.py", line 637, in _worker
2024-08-11T22:04:35.5195472Z     await self.handle_event(event)
2024-08-11T22:04:35.5195851Z   File "/home/runner/work/bbot/bbot/bbot/modules/unstructured.py", line 100, in handle_event
2024-08-11T22:04:35.5196134Z     content = await self.scan.helpers.run_in_executor_mp(extract_text, file_path)
2024-08-11T22:04:35.5196227Z LookupError: 
2024-08-11T22:04:35.5196363Z **********************************************************************
2024-08-11T22:04:35.5196527Z   Resource punkt_tab not found.
2024-08-11T22:04:35.5196708Z   Please use the NLTK Downloader to obtain the resource:
2024-08-11T22:04:35.5196715Z 
2024-08-11T22:04:35.5196837Z   >>> import nltk
2024-08-11T22:04:35.5196970Z   >>> nltk.download('punkt_tab')
2024-08-11T22:04:35.5197064Z   
2024-08-11T22:04:35.5197260Z   For more information see: https://www.nltk.org/data.html
2024-08-11T22:04:35.5197265Z 
2024-08-11T22:04:35.5197488Z   Attempted to load tokenizers/punkt_tab/english/
2024-08-11T22:04:35.5197493Z 
2024-08-11T22:04:35.5197584Z   Searched in:
2024-08-11T22:04:35.5197710Z     - '/home/runner/nltk_data'
2024-08-11T22:04:35.5198025Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/nltk_data'
2024-08-11T22:04:35.5198372Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/share/nltk_data'
2024-08-11T22:04:35.5198700Z     - '/home/runner/.cache/pypoetry/virtualenvs/bbot-pd-UZ8Fz-py3.9/lib/nltk_data'
2024-08-11T22:04:35.5198830Z     - '/usr/share/nltk_data'
2024-08-11T22:04:35.5198960Z     - '/usr/local/share/nltk_data'
2024-08-11T22:04:35.5199086Z     - '/usr/lib/nltk_data'
2024-08-11T22:04:35.5199218Z     - '/usr/local/lib/nltk_data'
2024-08-11T22:04:35.5199352Z **********************************************************************
TheTechromancer commented 2 months ago

Upstream issue: https://github.com/Unstructured-IO/unstructured/issues/3511

TheTechromancer commented 2 months ago

Resolved in https://github.com/blacklanternsecurity/bbot/pull/1669.

TheTechromancer commented 2 months ago

This error is still happening even after several updates from the unstructured team. We may need to temporarily disable this module until they get their shit together.