Closed bibi-b closed 1 year ago
Seems like it is the load time of the website causing a timeout. Obviously should not abort the entire process, but that seems to be the issue.
On another URL I received:
Paste in the URL of an online article or blog: http://get-nord.de
Traceback (most recent call last):
File "/app/collector/main.py", line 84, in
So I guess it is not only the load time causing that.
On another URL I received:
Paste in the URL of an online article or blog: http://get-nord.de Traceback (most recent call last): File "/app/collector/main.py", line 84, in main() File "/app/collector/main.py", line 52, in main link() File "/app/collector/scripts/link.py", line 44, in link os.makedirs(output_path) File "/usr/lib/python3.10/os.py", line 225, in makedirs mkdir(name, mode) PermissionError: [Errno 13] Permission denied: './outputs/website-logs'
So I guess it is not only the load time causing that.
Is this running in docker or on the host machine? Also are you running this on windows? Permission error would indicate that the python script does not have access to the output folder that it writes content to.
I am running this with docker on a linux server system.
I know that it has worked one or two releases before... Has something change d with the permissions within docker?
No, the docker permissions and setup has remain relatively unchanged since launch - certainly nothing permissions related. It may help to just chmod -R 777
the output directory in the docker container.
The app in docker should be executing as anythingllm
and the root
user is the original file owner.
You can always run these commands to re-own everything.
docker container exec -u 0 -t <CONTAINER_ID> mkdir -p /app/server/storage /app/server/storage/documents /app/server/storage/vector-cache /app/server/storage/lancedb
docker container exec -u 0 -t $<CONTAINER_ID> touch /app/server/storage/anythingllm.db
docker container exec -u 0 -t <CONTAINER_ID> chown -R anythingllm:anythingllm /app/collector /app/server
Will force create folders in the instance, make sure DB is writable, and will reown the entire repo just to be sure it can all execute.
Thanks.
docker container exec -u 0 -t 949d6a4f67ee chown -R anythingllm:anythingllm /app/collector /app/server
let the system stuck and after 5 min I decided to CTRL-C.
I logged into the container with docker exec -it <docker-id> bash
and ls -la
:
drwxr-xr-x 1 anythingllm anythingllm 4096 Aug 8 09:37 .
drwxr-x--- 1 anythingllm anythingllm 4096 Aug 8 09:44 ..
-rw-rw-r-- 1 root root 117 Aug 8 09:34 .gitignore
-rw-rw-r-- 1 root root 3170 Aug 8 09:34 README.md
drwxr-xr-x 2 anythingllm anythingllm 4096 Aug 8 09:37 pycache
-rw-rw-r-- 1 root root 820 Aug 8 09:34 api.py
drwxrwxr-x 2 anythingllm anythingllm 4096 Aug 8 09:34 hotdir
-rw-rw-r-- 1 root root 2513 Aug 8 09:34 main.py
drwxr-xr-x 2 root root 4096 Aug 8 09:37 outputs
-rw-rw-r-- 1 root root 1927 Aug 8 09:34 requirements.txt
drwxrwxr-x 3 root root 4096 Aug 8 09:34 scripts
drwxr-xr-x 1 anythingllm anythingllm 4096 Aug 8 09:36 v-env
-rw-rw-r-- 1 root root 588 Aug 8 09:34 watch.py
-rw-rw-r-- 1 root root 70 Aug 8 09:34 wsgi.py
Looks right, doesn't it?
That does not, the entire collector and server directory should be owned by anythingllm:anythingllm
- since it is owned by root I think there are some issues during execution because the app executes as anythingllm
usergroup so trying to read/write/execute from root
owned files isn't going to work.
The pull request #182 should solve the permission problems.
The fix in #182 does not solve the problem of large PDFs. Loading a large book (36MB, 1400 pages) leads to
anything-llm | PayloadTooLargeError: request entity too large
anything-llm | at readStream (/app/server/node_modules/raw-body/index.js:163:17)
anything-llm | at getRawBody (/app/server/node_modules/raw-body/index.js:116:12)
anything-llm | at read (/app/server/node_modules/body-parser/lib/read.js:79:3)
anything-llm | at textParser (/app/server/node_modules/body-parser/lib/types/text.js:86:5)
anything-llm | at Layer.handle [as handle_request] (/app/server/node_modules/express/lib/router/layer.js:95:5)
anything-llm | at trim_prefix (/app/server/node_modules/express/lib/router/index.js:328:13)
anything-llm | at /app/server/node_modules/express/lib/router/index.js:286:9
anything-llm | at Function.process_params (/app/server/node_modules/express/lib/router/index.js:346:12)
anything-llm | at next (/app/server/node_modules/express/lib/router/index.js:280:10)
anything-llm | at cors (/app/server/node_modules/cors/lib/index.js:188:7)
Fixed by c283ae33a3038b98cf05ea483c5663a233f1987b and subsequent fixes to ensure large payload bodys are allowed and timeouts are extended for the document processor.
Worked before. But after a complete new installation the dockerized repo shows the following:
? What kind of data would you like to add to convert into long-term memory? Article or Blog Link(s) ? Do you want to scrape a single article/blog/url or many at once? Single URL [NOTICE]: The first time running this process it will download supporting libraries.
Paste in the URL of an online article or blog: https://www.voigtdental.de [INFO] Starting Chromium download. 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109M/109M [00:03<00:00, 34.1Mb/s] [INFO] Beginning extraction [INFO] Chromium extracted to: /app/.local/share/pyppeteer/local-chromium/588429 Traceback (most recent call last): File "/app/collector/main.py", line 84, in
main()
File "/app/collector/main.py", line 52, in main
link()
File "/app/collector/scripts/link.py", line 24, in link
req.html.render()
File "/app/collector/v-env/lib/python3.10/site-packages/requests_html.py", line 598, in render
content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page))
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/app/collector/v-env/lib/python3.10/site-packages/requests_html.py", line 512, in _async_render
await page.goto(url, options={'timeout': int(timeout * 1000)})
File "/app/collector/v-env/lib/python3.10/site-packages/pyppeteer/page.py", line 837, in goto
raise error
pyppeteer.errors.TimeoutError: Navigation Timeout Exceeded: 8000 ms exceeded.