garyfeng / DocsGPT

GPT-powered chat for documentation search & assistance.
https://docsgpt.arc53.com/
MIT License
0 stars 0 forks source link

Q: Indexing other docs #2

Open garyfeng opened 1 year ago

garyfeng commented 1 year ago

See https://github.com/arc53/DocsGPT/wiki/How-to-train-on-other-documentation for how to index many docs in batch.

garyfeng commented 1 year ago

following the instruction to do so, on a container created on codespace.

Workaround -- run this elsewhere, say locally, and upload the output files as directed.

It runs out of space:

ERROR: Could not install packages due to an OSError: [Errno 28] No space left on device: '/usr/local/python/3.10.4/lib/python3.10/site-packages/torch/jit'

[notice] A new release of pip is available: 23.0.1 -> 23.1
[notice] To update, run: python -m pip install --upgrade pip
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ ls
code_docs_gen.py  ingest.py  inputs  old  parser  requirements.txt
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ df -H
Filesystem      Size  Used Avail Use% Mounted on
overlay          34G   33G     0 100% /
tmpfs            68M     0   68M   0% /dev
tmpfs           2.1G     0  2.1G   0% /sys/fs/cgroup
shm              68M  8.2k   68M   1% /dev/shm
/dev/sdb1        32G   23G  8.4G  74% /usr/sbin/docker-init
/dev/sda1        17G  349k   16G   1% /tmp
/dev/loop0       34G   33G     0 100% /workspaces
garyfeng commented 1 year ago

Try IBIS Reference Library. Need to scrape the text

garyfeng commented 1 year ago

Deleted the old codespace that was full. Created a new codespace called ingest_only from this repo, but use it only for ingestion. See the above steps for installation. pip install -r requirements.txt took a while, mostly to uninstall pytorch2.0 etc. But it was able to complete without the disk full error. After that, the follow runs (ingestion took like 30 secs), and the output was generated.

@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ python ingest.py ingest --recursive --formats .html
input_files
[]
Grouping small documents
Separating large documents
Number of Tokens = 13,903
Approx Cost = $0.01
Price Okay? (Y/N) 
y
Embedding 🦖: 100%|███████████████████████████████████████████████████████████████████████████████████| Time Left: 00:00
@garyfeng ➜ /workspaces/DocsGPT/scripts (main) $ ls outputs/inputs/
index.faiss  index.pkl
garyfeng commented 1 year ago

To test the newly ingested content,

  1. create a new codespace run_docsGPT
  2. copy the generated index.faiss and index.pkl from the ingest_only space to this space. I did the following: first download these files from the ingest_only codespace to localhost, then upload the files to the run_docsGPT space, replacing the original files in the applications folder.
  3. in the run_docsGPT space, create the .env file, and then do the docker-based test using docker-compose up. This will take about 15 minutes to build. when I launch the app, it says something like can't find the documents, which makes sense since I only gave it the index files, not the original text files.

I tried to see if I can copy the files to application/inputs. I had to use sudo to copy those files over for some reasons. But then the app runs, but loading what seems to be a buggy UI, with pandas being the preloaded documents.

garyfeng commented 1 year ago

see https://github.com/arc53/DocsGPT/issues/215. Hard to avoid this error any time I install using codespace. It used to run last week but no longer. After a lot of trial and error, it looks like the port forwarding is confused. Looking at the browser console, you get the "no default index" error at the start up when the frontend is trying to call the api at localhost:5001, when in codespace this should have been mapped to a codespace URL.

Following the suggestion in the above thread, I changed the docker-compose.yaml setting as below, and the no default index problem does not show. I could select the default index now. But then the chat does not work.

services:
  frontend:
    build: ./frontend
    environment:
      - VITE_API_HOST=https://garyfeng-upgraded-invention-xxxx-5001.preview.app.github.dev

Looking into the console log, it is now the CORS issue.

Access to XMLHttpRequest at 'https://garyfeng-upgraded-invention-xxxx-5001.preview.app.github.dev/api/upload' from origin 'https://garyfeng-upgraded-invention-xxxx-5173.preview.app.github.dev' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: Redirect is not allowed for a preflight request.
garyfeng commented 1 year ago

Giving up on codespace, I tried the docker installation on a PC and a Mac directly. Worked in both cases without glitch (well, in the mac case I had to change the docker-compose.yaml to not map ./applications to /app due to permission issues. Instead we set up a separate volume for /app for persistence). So we now have a repeatable process to get a clean installation.

Now we can go back to figuring out how to do multiple doc ingestion.

dartpain commented 1 year ago

Code spaces are quite useful for some people, i will try to solve once i get more time

dartpain commented 1 year ago

Thank you for reporting thoroughly on the issue