chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.79k stars 1.24k forks source link

[Bug]: peek() causes warning "Delete of nonexisting embedding ID" #969

Open andrewshvv opened 1 year ago

andrewshvv commented 1 year ago

What happened?

I have tried to remove the ids from the index which are non-existent, after that every peek() operation causes the warning Delete of nonexisting embedding ID. @HammadB mentioned warnings can be ignored, but nevertheless peek() shouldn't cause them. Relative discussion on Discord.

Here is chroma.zip for reproduction.

client = chromadb.PersistentClient(path=db_path)
posts = client.get_or_create_collection(
    name="posts",
    metadata={
        "hnsw:space": "cosine",
        "hnsw:M": 16,
        "hnsw:construction_ef": 200,
    }
)

Versions

Initially, I used chromadb==0.4.2, but before creating issues I switched on chromadb==0.4.5 to see if I see the same warnings, same result - I see warnings.

python = "^3.9.17"

Relevant log output

self._post_collection.peek(limit=0)["ids"]
PyDev console: starting.

2023-08-11 21:59:25 [T:MainThread] WARNING:chromadb.segment.impl.vector.brute_force_index: Delete of nonexisting embedding ID: 29
2023-08-11 21:59:25 [T:MainThread] WARNING:chromadb.segment.impl.vector.brute_force_index: Delete of nonexisting embedding ID: 31
2023-08-11 21:59:25 [T:MainThread] WARNING:chromadb.segment.impl.vector.brute_force_index: Delete of nonexisting embedding ID: 32
2023-08-11 21:59:25 [T:MainThread] WARNING:chromadb.segment.impl.vector.brute_force_index: Delete of nonexisting embedding ID: 33
2023-08-11 21:59:25 [T:MainThread] WARNING:chromadb.segment.impl.vector.brute_force_index: Delete of nonexisting embedding ID: 34
andrewshvv commented 1 year ago

Has been able to reproduce it, for some reason it doesn't happen right after delete, but only after restart. For the sake of info, I am actually doing stop(), when the program stops in my actual code.

client = chromadb.PersistentClient(path="test")
try:
    client.delete_collection(name="test_collection")
except ValueError:
    pass

collection = client.get_or_create_collection(
    "test_collection",
    metadata={"hnsw:space": "cosine"}
)

collection.add(
    embeddings=[[1, 2, 3]],
    ids=["1"]
)
collection.delete(ids=["3", "4", "5"])

client.stop() <=== improvised restart
client = chromadb.PersistentClient(path="test")
collection = client.get_or_create_collection(
    "test_collection",
    metadata={"hnsw:space": "cosine"}
)
print("peek")
collection.peek()["ids"]
andrewshvv commented 1 year ago

While trying to replicate the bug I encountered another interesting behavior, let me know if I need to create an issue for it.

client = chromadb.PersistentClient(path="test")
try:
    client.delete_collection(name="test_collection")
except ValueError:
    pass

collection = client.get_or_create_collection(
    "test_collection",
    metadata={"hnsw:space": "cosine"}
)

collection.delete(ids=["3", "4", "5"])

Delete of nonexisting embedding ID: 3
--- Logging error ---
Traceback (most recent call last):
  File "/Users/andrey/Library/Caches/pypoetry/virtualenvs/jobsearch-itTcmVTs-py3.9/lib/python3.9/site-packages/chromadb/db/mixins/embeddings_queue.py", line 263, in _notify_one
    sub.callback([embedding])
  File "/Users/andrey/Library/Caches/pypoetry/virtualenvs/jobsearch-itTcmVTs-py3.9/lib/python3.9/site-packages/chromadb/segment/impl/vector/local_persistent_hnsw.py", line 219, in _write_records
    ) is not None or self._brute_force_index.has_id(id)
AttributeError: 'NoneType' object has no attribute 'has_id'
HammadB commented 1 year ago

Yeah can you please file another bug with just that minimal repro, thanks. Will patch

ttww commented 6 months ago

I had the same problem, after terminating the embedding generation while debugging. The "solution" was to delete the index once and the problem never appears again...

Veenumittal commented 2 days ago

@ttww : How you performed the index deletion process?