chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.72k stars 1.23k forks source link

[Bug]: When I tried to import a large txt or a large number of txts into chroma, the program stopped without error. Why? #2749

Closed bill2766 closed 3 weeks ago

bill2766 commented 1 month ago

What happened?

When I tried to import a large txt or a large number of txts into chroma, the program stopped without error. Why?

This is the structure of file. image

This is the main.py:

import chromadb
import os
import shutil

with open('test2.txt', 'r') as file:
    content = file.read()

contents = []
ids = []

for i in range(200):
    contents.append(content)
    ids.append("id{}".format(i))
print(len(contents))

shutil.rmtree("chromadb", ignore_errors=True)
client = chromadb.PersistentClient(path="chromadb")

collection = client.create_collection(name="my_collection")
# collection = client.get_collection(name="my_collection")

collection.add(
    documents= contents,
    ids=ids
)

print(collection.get(ids=["id9"]))
# print(collection.get(ids=["id199"]))

The size of "test2.txt" is 13.5KB, which only contains letters like 'a' and 'b'.

Versions

Chroma v0.5.4, Python 3.10.14, Windows 11

Relevant log output

There were no errors.
bill2766 commented 1 month ago

In this case, the file chroma.sqlite3-journal would be retained.

tazarov commented 1 month ago

@bill2766, thanks for reporting this. Let me see if I get this right.

bill2766 commented 1 month ago
  1. haha, that's to test whether the code can run to that location. So when I import a large txt or a large number of txts into chroma, the code would stop without running the following code(print(collection.get(ids=["id9"]))).
  2. Yes. The code is used to reproduce the problem.

example content of test2.txt

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
(repeat many times)
tazarov commented 3 weeks ago

@bill2766, I think the way you read the file may break the onnx model. I did try your verbatim code with the file being generated like so:

import random

lines = ["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"]
with open('test2.txt', 'w') as file:
    for i in range(200):
        file.write(random.choice(lines))
        file.write("\n")

The observation after some testing is that the onnx model hangs pretty badly (will dig into that later on). I think it is related to me being on MacOS and using the default CoreML Provider.

Once I've fixed the Provider issue and the code to read the lines correctly. Things seem to work just fine. Here's the final code:

import chromadb
import os
import shutil

from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import ONNXMiniLM_L6_V2

ef  = ONNXMiniLM_L6_V2(preferred_providers=["CPUExecutionProvider"])

with open('test2.txt', 'r') as file:
    content = file.readline()

contents = []
ids = []

for i in range(200):
    contents.append(content.strip())
    ids.append("id{}".format(i))
print(len(contents))

shutil.rmtree("chromadb-2749", ignore_errors=True)
client = chromadb.PersistentClient(path="chromadb-2749")

collection = client.create_collection(name="my_collection")
# collection = client.get_collection(name="my_collection")

collection.add(
    documents= contents,
    ids=ids
)

print(collection.get(ids=["id9"]))
# print(collection.get(ids=["id199"]))

Resulting in the following output:

200
{'ids': ['id9'], 'embeddings': None, 'metadatas': [None], 'documents': ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}
bill2766 commented 3 weeks ago

I test the code in my environment, but it doesn't work.

import chromadb
import os
import shutil

from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import ONNXMiniLM_L6_V2

ef  = ONNXMiniLM_L6_V2(preferred_providers=["CPUExecutionProvider"])

with open('test2.txt', 'r') as file:
    content = file.readline()

contents = []
ids = []

for i in range(200):
    contents.append(content.strip())
    ids.append("id{}".format(i))

shutil.rmtree("chromadb-2749", ignore_errors=True)
client = chromadb.PersistentClient(path="chromadb-2749")

collection = client.create_collection(name="my_collection")
# collection = client.get_collection(name="my_collection")

collection.add(
    documents= contents,
    ids=ids
)
print(len(contents))

print(collection.get(ids=["id9"]))
# print(collection.get(ids=["id199"]))

result: no print and sqlite3-journal exists.

tazarov commented 3 weeks ago

@bill2766, thanks for testing it out. Let me ask this. When you say result: does it mean nothing returns e.g. the process crashes or the process hangs?

bill2766 commented 3 weeks ago

yes, the program stopped without print and error.

tazarov commented 3 weeks ago

@bill2766, this looks like HNSW silent crash we're experiencing on Windows with AMD processors (#2513). Can you downgrade to version 0.5.0 and try the script?

pip install chromadb==0.5.0
bill2766 commented 3 weeks ago

@tazarov thanks, it works, and ONNXMiniLM_L6_V2 need to be commented out. My virtual env is python=3.9, numpy=1.26, chromadb=0.5.0, windows with intel processor.