Open garyfeng opened 1 year ago
following the instruction to do so, on a container created on codespace
.
codespace
instance running VS Code in browserdocs/source
we find a bunch of .rst
files. Chose pyro
because it is not well known.inputs
under /scripts
, as directed./scripts/inputs
folder is in .gitignore
so it's not uploaded to git. /.env
file to /scripts
, as requiredpip install -r requirements
in a terminal under /scripts
requirements.txt
asks for a different version of pytorch and many other things, so it may mess your main env. Workaround -- run this elsewhere, say locally, and upload the output files as directed.
It runs out of space:
ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device: '/usr/local/python/3.10.4/lib/python3.10/site-packages/torch/jit'
[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python -m pip install --upgrade pip
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ ls
code_docs_gen.py ingest.py inputs old parser requirements.txt
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ df -H
Filesystem Size Used Avail Use% Mounted on
overlay 34G 33G 0 100% /
tmpfs 68M 0 68M 0% /dev
tmpfs 2.1G 0 2.1G 0% /sys/fs/cgroup
shm 68M 8.2k 68M 1% /dev/shm
/dev/sdb1 32G 23G 8.4G 74% /usr/sbin/docker-init
/dev/sda1 17G 349k 16G 1% /tmp
/dev/loop0 34G 33G 0 100% /workspaces
Try IBIS Reference Library. Need to scrape the text
Deleted the old codespace that was full. Created a new codespace called ingest_only
from this repo, but use it only for ingestion. See the above steps for installation. pip install -r requirements.txt
took a while, mostly to uninstall pytorch2.0 etc. But it was able to complete without the disk full error. After that, the follow runs (ingestion took like 30 secs), and the output was generated.
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ python ingest.py ingest --recursive --formats .html
input_files
[]
Grouping small documents
Separating large documents
Number of Tokens = 13,903
Approx Cost = $0.01
Price Okay? (Y/N)
y
Embedding 🦖: 100%|███████████████████████████████████████████████████████████████████████████████████| Time Left: 00:00
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ ls outputs/inputs/
index.faiss index.pkl
To test the newly ingested content,
run_docsGPT
index.faiss
and index.pkl
from the ingest_only
space to this space. I did the following: first download these files from the ingest_only
codespace to localhost, then upload the files to the run_docsGPT
space, replacing the original files in the applications
folder. run_docsGPT
space, create the .env
file, and then do the docker-based test using docker-compose up
. This will take about 15 minutes to build.
when I launch the app, it says something like can't find the documents, which makes sense since I only gave it the index files, not the original text files.I tried to see if I can copy the files to application/inputs
. I had to use sudo
to copy those files over for some reasons. But then the app runs, but loading what seems to be a buggy UI, with pandas
being the preloaded documents.
see https://github.com/arc53/DocsGPT/issues/215. Hard to avoid this error any time I install using codespace
. It used to run last week but no longer. After a lot of trial and error, it looks like the port forwarding is confused. Looking at the browser console, you get the "no default index" error at the start up when the frontend is trying to call the api at localhost:5001
, when in codespace
this should have been mapped to a codespace URL.
Following the suggestion in the above thread, I changed the docker-compose.yaml
setting as below, and the no default index
problem does not show. I could select the default index now. But then the chat does not work.
services:
frontend:
build: ./frontend
environment:
- VITE_API_HOST=https://garyfeng-upgraded-invention-xxxx-5001.preview.app.github.dev
Looking into the console log, it is now the CORS issue.
Access to XMLHttpRequest at 'https://garyfeng-upgraded-invention-xxxx-5001.preview.app.github.dev/api/upload' from origin 'https://garyfeng-upgraded-invention-xxxx-5173.preview.app.github.dev' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.
Giving up on codespace
, I tried the docker
installation on a PC and a Mac directly. Worked in both cases without glitch (well, in the mac case I had to change the docker-compose.yaml
to not map ./applications
to /app
due to permission issues. Instead we set up a separate volume for /app
for persistence). So we now have a repeatable process to get a clean installation.
Now we can go back to figuring out how to do multiple doc ingestion.
Code spaces are quite useful for some people, i will try to solve once i get more time
Thank you for reporting thoroughly on the issue
See https://github.com/arc53/DocsGPT/wiki/How-to-train-on-other-documentation for how to index many docs in batch.