Mintplex-Labs / anything-llm

The all-in-one Desktop & Docker AI application with built-in RAG, AI agents, and more.
https://anythingllm.com
MIT License
23.64k stars 2.38k forks source link

Collector failing with timeout error, cannot restart. #334

Closed Britman72 closed 10 months ago

Britman72 commented 11 months ago

I was trying to collect an entire website when it bailed near the end. Here's the log below. When I try to restart the menu, it also throws an error.

Working on https://www.nbtclothing.com... [INFO] Starting Chromium download. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 86.8M/86.8M [00:02<00:00, 35.7Mb/s] [INFO] Beginning extraction [INFO] Chromium extracted to: /Users/britman/Library/Application Support/pyppeteer/local-chromium/588429 [nltk_data] Downloading package punkt to /Users/britman/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] /Users/britman/nltk_data... [nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip. Working on https://www.nbtclothing.com/collections/all... Working on https://www.nbtclothing.com/account/login... Working on https://www.nbtclothing.com/collections/best-sellers... Working on https://www.nbtclothing.com/collections/best-sellers... Working on https://www.nbtclothing.com/policies/shipping-policy... Working on https://www.nbtclothing.com/collections/all... Working on https://www.nbtclothing.com/pages/returns-exchanges... Working on https://www.nbtclothing.com/products/nomad... Traceback (most recent call last): File "/Users/britman/textrify/AnythingLLM/collector/main.py", line 84, in <module> main() File "/Users/britman/textrify/AnythingLLM/collector/main.py", line 58, in main crawler() File "/Users/britman/textrify/AnythingLLM/collector/scripts/link.py", line 97, in crawler parse_links(links) File "/Users/britman/textrify/AnythingLLM/collector/scripts/link.py", line 134, in parse_links req.html.render(timeout=10) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/britman/textrify/AnythingLLM/collector/v-env/lib/python3.11/site-packages/requests_html.py", line 605, in render raise MaxRetries("Unable to render the page. Try increasing timeout") requests_html.MaxRetries: Unable to render the page. Try increasing timeout

Restarting it:

(v-env) britman@Als-Office collector % python main.py Traceback (most recent call last): File "/Users/britman/textrify/AnythingLLM/collector/main.py", line 4, in <module> from scripts.link import link, links, crawler File "/Users/britman/textrify/AnythingLLM/collector/scripts/link.py", line 3, in <module> from requests_html import HTMLSession File "/Users/britman/textrify/AnythingLLM/collector/v-env/lib/python3.11/site-packages/requests_html.py", line 489 return self IndentationError: unexpected indent (v-env) britman@Als-Office collector % python main.py Traceback (most recent call last): File "/Users/britman/textrify/AnythingLLM/collector/main.py", line 4, in <module> from scripts.link import link, links, crawler File "/Users/britman/textrify/AnythingLLM/collector/scripts/link.py", line 3, in <module> from requests_html import HTMLSession File "/Users/britman/textrify/AnythingLLM/collector/v-env/lib/python3.11/site-packages/requests_html.py", line 489 return self IndentationError: unexpected indent

timothycarambat commented 10 months ago

The restart error is bizzare since the error is coming from a provider and not code we wrote (requests_html.py). Additionally the original error came from the HTTP timeout - again not related to the code but related to the website being scrape in question and on a timeout it should abort.

Closing since this is not part of the code we wrote or maintain. I do notice you are using python3.11, can you do a new v-env with python3.9? This should not impact the requests library