URL scraper ends with error

bibi-b commented 1 year ago

Worked before. But after a complete new installation the dockerized repo shows the following:

? What kind of data would you like to add to convert into long-term memory? Article or Blog Link(s) ? Do you want to scrape a single article/blog/url or many at once? Single URL [NOTICE]: The first time running this process it will download supporting libraries.

Paste in the URL of an online article or blog: https://www.voigtdental.de [INFO] Starting Chromium download. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109M/109M [00:03<00:00, 34.1Mb/s] [INFO] Beginning extraction [INFO] Chromium extracted to: /app/.local/share/pyppeteer/local-chromium/588429 Traceback (most recent call last): File "/app/collector/main.py", line 84, in main() File "/app/collector/main.py", line 52, in main link() File "/app/collector/scripts/link.py", line 24, in link req.html.render() File "/app/collector/v-env/lib/python3.10/site-packages/requests_html.py", line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/app/collector/v-env/lib/python3.10/site-packages/requests_html.py", line 512, in _async_render await page.goto(url, options={'timeout': int(timeout * 1000)}) File "/app/collector/v-env/lib/python3.10/site-packages/pyppeteer/page.py", line 837, in goto raise error pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 8000 ms exceeded.

timothycarambat commented 1 year ago

Seems like it is the load time of the website causing a timeout. Obviously should not abort the entire process, but that seems to be the issue.

bibi-b commented 1 year ago

On another URL I received:

Paste in the URL of an online article or blog: http://get-nord.de Traceback (most recent call last): File "/app/collector/main.py", line 84, in main() File "/app/collector/main.py", line 52, in main link() File "/app/collector/scripts/link.py", line 44, in link os.makedirs(output_path) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: './outputs/website-logs'

So I guess it is not only the load time causing that.

timothycarambat commented 1 year ago

On another URL I received:

Paste in the URL of an online article or blog: http://get-nord.de Traceback (most recent call last): File "/app/collector/main.py", line 84, in main() File "/app/collector/main.py", line 52, in main link() File "/app/collector/scripts/link.py", line 44, in link os.makedirs(output_path) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: './outputs/website-logs'

So I guess it is not only the load time causing that.

Is this running in docker or on the host machine? Also are you running this on windows? Permission error would indicate that the python script does not have access to the output folder that it writes content to.

bibi-b commented 1 year ago

I am running this with docker on a linux server system.

I know that it has worked one or two releases before... Has something change d with the permissions within docker?

timothycarambat commented 1 year ago

No, the docker permissions and setup has remain relatively unchanged since launch - certainly nothing permissions related. It may help to just chmod -R 777 the output directory in the docker container.

The app in docker should be executing as anythingllm and the root user is the original file owner.

You can always run these commands to re-own everything. docker container exec -u 0 -t <CONTAINER_ID> mkdir -p /app/server/storage /app/server/storage/documents /app/server/storage/vector-cache /app/server/storage/lancedb docker container exec -u 0 -t $<CONTAINER_ID> touch /app/server/storage/anythingllm.db docker container exec -u 0 -t <CONTAINER_ID> chown -R anythingllm:anythingllm /app/collector /app/server

Will force create folders in the instance, make sure DB is writable, and will reown the entire repo just to be sure it can all execute.

bibi-b commented 1 year ago

Thanks.

docker container exec -u 0 -t 949d6a4f67ee chown -R anythingllm:anythingllm /app/collector /app/server let the system stuck and after 5 min I decided to CTRL-C.

I logged into the container with docker exec -it <docker-id> bash and ls -la: drwxr-xr-x 1 anythingllm anythingllm 4096 Aug 8 09:37 . drwxr-x--- 1 anythingllm anythingllm 4096 Aug 8 09:44 .. -rw-rw-r-- 1 root root 117 Aug 8 09:34 .gitignore -rw-rw-r-- 1 root root 3170 Aug 8 09:34 README.md drwxr-xr-x 2 anythingllm anythingllm 4096 Aug 8 09:37 pycache -rw-rw-r-- 1 root root 820 Aug 8 09:34 api.py drwxrwxr-x 2 anythingllm anythingllm 4096 Aug 8 09:34 hotdir -rw-rw-r-- 1 root root 2513 Aug 8 09:34 main.py drwxr-xr-x 2 root root 4096 Aug 8 09:37 outputs -rw-rw-r-- 1 root root 1927 Aug 8 09:34 requirements.txt drwxrwxr-x 3 root root 4096 Aug 8 09:34 scripts drwxr-xr-x 1 anythingllm anythingllm 4096 Aug 8 09:36 v-env -rw-rw-r-- 1 root root 588 Aug 8 09:34 watch.py -rw-rw-r-- 1 root root 70 Aug 8 09:34 wsgi.py

Looks right, doesn't it?

timothycarambat commented 1 year ago

That does not, the entire collector and server directory should be owned by anythingllm:anythingllm - since it is owned by root I think there are some issues during execution because the app executes as anythingllm usergroup so trying to read/write/execute from root owned files isn't going to work.

efocht commented 1 year ago

The pull request #182 should solve the permission problems.

efocht commented 1 year ago

The fix in #182 does not solve the problem of large PDFs. Loading a large book (36MB, 1400 pages) leads to

anything-llm    | PayloadTooLargeError: request entity too large
anything-llm    |     at readStream (/app/server/node_modules/raw-body/index.js:163:17)
anything-llm    |     at getRawBody (/app/server/node_modules/raw-body/index.js:116:12)
anything-llm    |     at read (/app/server/node_modules/body-parser/lib/read.js:79:3)
anything-llm    |     at textParser (/app/server/node_modules/body-parser/lib/types/text.js:86:5)
anything-llm    |     at Layer.handle [as handle_request] (/app/server/node_modules/express/lib/router/layer.js:95:5)
anything-llm    |     at trim_prefix (/app/server/node_modules/express/lib/router/index.js:328:13)
anything-llm    |     at /app/server/node_modules/express/lib/router/index.js:286:9
anything-llm    |     at Function.process_params (/app/server/node_modules/express/lib/router/index.js:346:12)
anything-llm    |     at next (/app/server/node_modules/express/lib/router/index.js:280:10)
anything-llm    |     at cors (/app/server/node_modules/cors/lib/index.js:188:7)

timothycarambat commented 1 year ago

Fixed by c283ae33a3038b98cf05ea483c5663a233f1987b and subsequent fixes to ensure large payload bodys are allowed and timeouts are extended for the document processor.

Mintplex-Labs / anything-llm

URL scraper ends with error #173