Closed bill2766 closed 3 weeks ago
In this case, the file chroma.sqlite3-journal
would be retained.
@bill2766, thanks for reporting this. Let me see if I get this right.
print(collection.get(ids=["id9"]))
? test2.txt
file)?print(collection.get(ids=["id9"]))
).example content of test2.txt
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
(repeat many times)
@bill2766, I think the way you read the file may break the onnx model. I did try your verbatim code with the file being generated like so:
import random
lines = ["aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa","bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb"]
with open('test2.txt', 'w') as file:
for i in range(200):
file.write(random.choice(lines))
file.write("\n")
The observation after some testing is that the onnx model hangs pretty badly (will dig into that later on). I think it is related to me being on MacOS and using the default CoreML Provider.
Once I've fixed the Provider issue and the code to read the lines correctly. Things seem to work just fine. Here's the final code:
import chromadb
import os
import shutil
from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import ONNXMiniLM_L6_V2
ef = ONNXMiniLM_L6_V2(preferred_providers=["CPUExecutionProvider"])
with open('test2.txt', 'r') as file:
content = file.readline()
contents = []
ids = []
for i in range(200):
contents.append(content.strip())
ids.append("id{}".format(i))
print(len(contents))
shutil.rmtree("chromadb-2749", ignore_errors=True)
client = chromadb.PersistentClient(path="chromadb-2749")
collection = client.create_collection(name="my_collection")
# collection = client.get_collection(name="my_collection")
collection.add(
documents= contents,
ids=ids
)
print(collection.get(ids=["id9"]))
# print(collection.get(ids=["id199"]))
Resulting in the following output:
200
{'ids': ['id9'], 'embeddings': None, 'metadatas': [None], 'documents': ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'], 'uris': None, 'data': None, 'included': ['metadatas', 'documents']}
I test the code in my environment, but it doesn't work.
import chromadb
import os
import shutil
from chromadb.utils.embedding_functions.onnx_mini_lm_l6_v2 import ONNXMiniLM_L6_V2
ef = ONNXMiniLM_L6_V2(preferred_providers=["CPUExecutionProvider"])
with open('test2.txt', 'r') as file:
content = file.readline()
contents = []
ids = []
for i in range(200):
contents.append(content.strip())
ids.append("id{}".format(i))
shutil.rmtree("chromadb-2749", ignore_errors=True)
client = chromadb.PersistentClient(path="chromadb-2749")
collection = client.create_collection(name="my_collection")
# collection = client.get_collection(name="my_collection")
collection.add(
documents= contents,
ids=ids
)
print(len(contents))
print(collection.get(ids=["id9"]))
# print(collection.get(ids=["id199"]))
result: no print and sqlite3-journal exists.
@bill2766, thanks for testing it out. Let me ask this. When you say result:
does it mean nothing returns e.g. the process crashes or the process hangs?
yes, the program stopped without print and error.
@bill2766, this looks like HNSW silent crash we're experiencing on Windows with AMD processors (#2513). Can you downgrade to version 0.5.0
and try the script?
pip install chromadb==0.5.0
@tazarov thanks, it works, and ONNXMiniLM_L6_V2
need to be commented out. My virtual env is python=3.9, numpy=1.26, chromadb=0.5.0
, windows with intel processor.
What happened?
When I tried to import a large txt or a large number of txts into chroma, the program stopped without error. Why?
This is the structure of file.
This is the main.py:
The size of "test2.txt" is 13.5KB, which only contains letters like 'a' and 'b'.
Versions
Chroma v0.5.4, Python 3.10.14, Windows 11
Relevant log output