chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.19k stars 1.28k forks source link

[Bug]: from_documents method exit the function #2585

Open HarshithDR opened 3 months ago

HarshithDR commented 3 months ago

What happened?

I was working with langchain and chromadb, i faced the issue of program stop working while excecuting the below code vectorstore = Chroma.from_documents(all_splits, embedding_function) I tried downgrading chromadb version, 0.5.3 is working fine, but versions after that is not working.

Versions

chromadb 0.5.5 and chromadb 0.5.4

Relevant log output

No response

tazarov commented 3 months ago

@HarshithDR, thanks for filing this with us. Can you tell me what OS and CPU arch you have?

Recently, we've seen issues with HNSW lib (a pre-compiled binary that is shipped with Chroma) on Windows with AMD processors.

HarshithDR commented 3 months ago

I am using x86 intel i5 13420H processor. With Windows operating system.

tazarov commented 3 months ago

Thanks for confirming. What is your Python version? Is it possible to downgrade to Python 3.10 and rerun your tests? (see here https://github.com/chroma-core/chroma/issues/2513#issuecomment-2254666266)

HarshithDR commented 3 months ago

I tried with Python 3.10, but i am facing the same error

for testing i will drop the testing code below

`from langchain_chroma import Chroma from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings from langchain_text_splitters import CharacterTextSplitter from langchain.schema import Document # Correct import for Document import requests from bs4 import BeautifulSoup

Function to get the article content from the Medium link

def get_medium_article_content(url): response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') article = soup.find('article') paragraphs = article.find_all('p') text = '\n'.join(paragraph.get_text() for paragraph in paragraphs) return text

URL of the Medium article

url = "https://medium.com/analytics-vidhya/tensorflow-gpu-installation-with-cuda-cudnn-40fbd4477e7" article_text = get_medium_article_content(url)

Create a Document object with the article text

documents = [Document(page_content=article_text)]

Split it into chunks

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents)

Create the open-source embedding function

embedding_function = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2") print(1343)

Load it into Chroma

db = Chroma.from_documents(docs, embedding_function) print(23)

Query it

query = "What are the steps to install TensorFlow GPU?" docs = db.similarity_search(query)

Print results

for doc in docs: print(doc.page_content) `

db = Chroma.from_documents(docs, embedding_function) -- this line breaks the code

HarshithDR commented 3 months ago

If i switch to chromadb==0.5.3, automatically code works without breaking.

tazarov commented 3 months ago

@HarshithDR, indeed 0.5.3 ships with chroma-hnswlib==0.7.3, which is on all OS/CPU arch (with some minor exceptions we've observed). Thanks for confirming that the bug is reproducible with the following:

HarshithDR commented 3 months ago

Thanks for the update, so then i will work with chroma-0.5.3