run_localGPT_API.py reset my DB

Alio241 commented 1 year ago

Hi everyone, i have a small probleme when i launch run_localGPT_API, it resets my DB and seems to redo the training.

I modified ingest.py so that i can get all the files from a folder (in an other path then SOURCE_DOCUMENTS) and all of its subfolder and it works when i launch run_localGPT.py

Do you guys have any idea why it does that ? Because now it make the UI unusable.

gerardorosiles commented 1 year ago

I ran into the same issue. It seems run_localGPT_API tries to run the ingest code whenever it is started. Does not make sense to me. I am working on this atm, but wanted to mention this as a pointer. I'll report back with my results.

creuzerm commented 1 year ago

Hrmm... I am re-ingesting right now. I was doing CLI for a few days without issue. Had a power outage triggered computer reboot and tried to use the UI but the database seemed like it disappeared. I assumed it was connected to my reboot as my conda venv seemed to empty itself and needed to be repopulated with requirements.txt.

gerardorosiles commented 1 year ago

I can run ingest.py by itself and generate the index "off-line" with a set of about 200 pdf and docx documents. The index generated under the PERSIST_DIRECTORY can be used with not problem with the CLI app run_LocalGPT.py

In the case of run_LocalGPT_API.py there is code to run the ingest.py script to create the ChromaDB index on the fly. This may be ok for the simple example that uses the constitution.pdf document BUT not good for large text corpus. So I removed this code to use my previously created index (under PERSIST_DIRECTORY) and start the run_LocalGPT.py application.

When the ChromeDB object is instantiated, the contents of PERSIST_DIRECTORY are wiped out! Connecting with the UI and sending a questions I get an exception which basically boils down to this:

File "/opt/conda/envs/local-gpt/lib/python3.10/site-packages/chromadb/db/index/hnswlib.py", line 230, in get_nearest_neighbors raise NoIndexException( chromadb.errors.NoIndexException: Index not found, please create an instance before querying INFO:werkzeug:127.0.0.1 - - [04/Aug/2023 18:27:18] "POST /api/prompt_route HTTP/1.1" 500 -

I don't understand why Chroma picks up that data exists when running the CLI but not the API. I tried different approaches to resolve the issue but is not working. A similar issue was reported in PrivateGPT and someone forked the repo to use Weavetie instead. I can't find the link anymore.

So at the moment the way to go seems to be ditching Chroma for some other vector store. I am trying OpenSearch a the moment

gerardorosiles commented 1 year ago

Here is the link with a similar discussion: https://github.com/imartinez/privateGPT/issues/132

Since all the ChromaDB operations are hidden in LangChain, I get a feeling this is LangChain config issue or bug.

Alio241 commented 1 year ago

I believe i found the probleme, in the run_localGPT_API file at the very begining.

there is this line of code : shutil.rmtree(PERSIST_DIRECTORY)

and it is this line of code that delete the current DB

malakhovks commented 1 year ago

I believe i found the probleme, in the run_localGPT_API file at the very begining.

there is this line of code : shutil.rmtree(PERSIST_DIRECTORY)

and it is this line of code that delete the current DB

Unfortunately it doesn't help

malakhovks commented 1 year ago

Dear @PromtEngineer , I am researcher and backend developer from Ukraine https://linktr.ee/malakhovks. Each time run_localGPT_API.py is run, the shutil.rmtree(PERSIST_DIRECTORY) directory with already existing ChromaDB indexes is erased and the ingest.py script is initialized to generate a new one ChromaDB index. This really slows down working with a large corpus of texts. I removed from the run_localGPT_API.py file the code block:

if os.path.exists(PERSIST_DIRECTORY):
    try:
        shutil.rmtree(PERSIST_DIRECTORY)
    except OSError as e:
        print(f"Error: {e.filename} - {e.strerror}.")
else:
    print("The directory does not exist")

run_langest_commands = ["python", "ingest.py"]
if DEVICE_TYPE == "cpu":
    run_langest_commands.append("--device_type")
    run_langest_commands.append(DEVICE_TYPE)

result = subprocess.run(run_langest_commands, capture_output=True)
if result.returncode != 0:
    raise FileNotFoundError(
        "No files were found inside SOURCE_DOCUMENTS, please put a starter file inside before starting the API!"
    )

that is responsible for erasing and generating a new index, but in the end the already existing indexes are not loaded and I get the error: chromadb.errors.NoIndexException: Index not found, please create an instance before querying when posted to api/prompt_route endpoint.

But this does not happen when working with run_localGPT.py. Indexes are loaded .

As @gerardorosiles said: "it is not good for large text corpus".

We must be missing something in the code.

I think we need to transfer the load function def load_model(device_type, model_id, model_basename=None) from run_localGPT.py to run_localGPT_API.py. I will try to do that

PromtEngineer commented 1 year ago

@malakhovks @Alio241 I am looking into it. One use case of this behavior is that if you are running this as a service then you want to just use the vector store created with the current file that is being uploaded than what might be already in the DB. But we can modify this if that's not the expected behavior.

malakhovks commented 1 year ago

@malakhovks @Alio241 I am looking into it. One use case of this behavior is that if you are running this as a service then you want to just use the vector store created with the current file that is being uploaded than what might be already in the DB. But we can modify this if that's not the expected behavior.

We need possibility to load already existed ChromaDB index

malakhovks commented 1 year ago

Dear @PromtEngineer , @gerardorosiles , @Alio241 , @creuzerm

So, I've done some analysis and testing.

First, if we work with a large dataset (corpus of texts in pdf etc), it is better to build the Chroma DB index separately using the ingest.py script. So , the procedure for creating an index at startup is not needed in the run_localGPT_API.py script. Remove it. In the ingest.py script I added the very check for the existence of the already generated index and its deletion (if we build a new index, we don't need the old one). The model loading function def load_model(device_type, model_id, model_basename=None) has been moved to run_localGPT_API.py.

The pipeline for large dataset:

Upload manually via sftp data the SOURCE_DOCUMENTS folder
Run ingest.py to build the new Chroma DB index
Run run_localGPT_API.py
POST to /api/prompt_route your query
profit, after restart of run_localGPT_API.py it will get already built index

Edited versions of run_localGPT_API.py and ingest.py you can find in the updates zip attachment

PS.: I will make docker compose for such use case and add some small but important fixes soon (add uwsgi, nginx as reverse proxy, etc).

updates.zip

ww2283 commented 11 months ago

@malakhovks @Alio241 I am looking into it. One use case of this behavior is that if you are running this as a service then you want to just use the vector store created with the current file that is being uploaded than what might be already in the DB. But we can modify this if that's not the expected behavior.

Whatever behavior it is, don't delete the DB folder at the first place... And for now probably put a warning to users trying to use the run_localGPT_API.py about this outcome would be a good idea. I wish I had read the code earlier before I run it...

PromtEngineer / localGPT

run_localGPT_API.py reset my DB #311