chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.36k stars 1.14k forks source link

[Bug]: kernel die when training vanna ai (with chromadb) with more than 90 vectors. #2405

Open mkhansa opened 6 days ago

mkhansa commented 6 days ago

What happened?

when adding training data (350 vector) to vanna ai model, it is consuming 100%+ of the cpu (12th Gen Intel(R) i7 - 12650H) and 32 GB RAM and the kernel will "die". i need to decrease the number of vectors to 70-80 only.

python code:

class MyVanna(ChromaDB_VectorStore, GoogleGeminiChat): def init(self, config=None): ChromaDB_VectorStore.init(self, config={ "path": "../path/VannaAI_path" }) GoogleGeminiChat.init(self, config={'api_key': 'XXXXXX, "temperature":0, 'model': "gemini-1.5-pro"})

vn = MyVanna() .....

with open('../training_data/doc_training_data.json', 'r') as f: documentation_list = json.load(f)["documentation"]

for rule in documentation_list: print(rule) vn.train(documentation=rule) (here, the notebook crashed)

Versions

chromadb==0.5.3 , Python 3.12.2, windows

Relevant log output

No response

tazarov commented 5 days ago

@mkhansa, I'm not familiar with Vanna AI and the problem they are solving. At a glance, it seems it is a RAG application aimed at answering SQL-related questions. Their train workflow seems to be using an LLM to create embeddings from docs, schemas, DDLs etc. Their use of Chroma is also quite straightforward. Without a deeper understanding of what their training workflow does beyond adding embeddings for the documentation in Chroma, I cannot say what could be causing this issue.

To test further, can I ask you to run Chroma in a separate instance e.g. docker or CLI, and then create an HttpClient and pass that as configuration in the Vanna vector store: https://github.com/vanna-ai/vanna/blob/8cc20fbd22d73dd0321cc7464860c0f15080f3ad/src/vanna/chromadb/chromadb_vector.py#L23

Then run your workbook as above and check your processes to see which one consumes the 100% CPU.