chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.56k stars 1.3k forks source link

[Bug]: Chroma.add terminates flask without any error #2438

Open petergaoshan opened 4 months ago

petergaoshan commented 4 months ago

What happened?

I'm creating an API with Flask. The other side will send me a file and I will save it to chroma database on my side. Chroma.add will terminates my program without any exception. When I save a smaller file to it, it will be fine, when send a larger file it will crash. Firstly, I thought it might be memory problem, and I tested the same code in jupyter notebook outside flask. When I run the same code in jupyter notebook, it will run properly.

def save_w_chunking(self, docs: List[Document]) -> None:

        text_splitter = SemanticChunker(self._embeddings, breakpoint_threshold_type = "percentile", breakpoint_threshold_amount = 80, sentence_split_regex = r'(?<=[。?!])|(?<=\n)')

        docs = text_splitter.split_documents(docs)

        seen_docs = []
        temp_docs = []

        for d in docs:

            is_unique = d.page_content not in seen_docs
            has_content = len(d.page_content.strip().strip("\n")) > 0

            if is_unique and has_content:

                seen_docs.append(d.page_content)

                d.page_content =  d.metadata["filename"] + ":\n" + d.page_content

                temp_docs.append(d)

        docs = temp_docs
        docs = filter_complex_metadata(docs)

        if len(docs) == 0:
            return

        try:

            t = [d.page_content for d in docs]
            m = [d.metadata for d in docs]
            ids = [str(uuid.uuid4()) for _ in range(len(t))]

            self._ChromaDB.add(ids = ids,
                               documents = t,
                               metadatas = m)

        except Exception as e:
            print("caught exception: ", e)
@app.route('/ChromaEditor', methods = ['POST'])
def upload_file():

    result = {"msg" : "success"}

    file = request.files["file"]

    file_path = os.path.join("blink", file.filename)

    file.save(file_path)

    text_doc, table_doc = unstrucutured_to_Doc([file_path])

    print("successfully parsed " + file.filename)

    CE.save_w_chunking(text_doc)
    CE.save_wo_chunking(table_doc)

    print("successfully saved " + file.filename)

    return result

Versions

python 3.12.3 chromadb 0.5.0 langchain-chroma 0.1.1

Relevant log output

* Debug mode: off
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
Press CTRL+C to quit
successfully parsed hello.docx

(llm) C:\Users\Desktop>
tazarov commented 4 months ago

@petergaoshan, do you run your flask app in a container? It might get terminated if you run out of memory.

tazarov commented 3 months ago

@petergaoshan, possible cause for this might be chroma-hnswlib raising a segmentation fault, which sometimes is masked. #2513