chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.51k stars 1.21k forks source link

[Bug]: Embeddings Deletion Causes "Delete of nonexisting embedding ID" #989

Open mickey-lyx opened 1 year ago

mickey-lyx commented 1 year ago

What happened?

Hi there, I tried to upload two PDF files to a persistant collection and delete one of them. But I received Warning Messages: "Delete of nonexisting embedding ID". This Warning only appears when I upload multiple files and delete one of them. Here are my test files and code.

alphabet-2023-q1-10q.pdf Apple Inc.-10K.pdf

from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

def main():
    # create collection
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())
    text_splitter = CharacterTextSplitter(chunk_size=1500, chunk_overlap=100)

    # load document_1
    loader_1 = PyPDFLoader("alphabet-2023-q1-10q.pdf")
    documents1 = loader_1.load()
    docs_1 = text_splitter.split_documents(documents1)
    ids_1 = [str(i) for i in range(1, len(docs_1) + 1)]
    texts_1 = [split.page_content for split in docs_1]
    metadatas_1 = [split.metadata for split in docs_1]
    collection.add(ids=ids_1, metadatas=metadatas_1, documents=texts_1)

    # load document_2
    loader_2 = PyPDFLoader("Apple Inc.-10K.pdf")
    documents_2 = loader_2.load()
    docs_2 = text_splitter.split_documents(documents_2)
    ids_2 = [str(i) for i in range(47, len(docs_2) + 47)]
    texts_2 = [split.page_content for split in docs_2]
    metadatas_2 = [split.metadata for split in docs_2]
    collection.add(ids=ids_2, metadatas=metadatas_2, documents=texts_2)

    print(f"ids_1: {ids_1}")
    print(f"ids_2: {ids_2}")

    print("count before", collection.count())
    # delete document_1
    collection.delete(ids_1)
    print("count after", collection.count())

if __name__ == '__main__':
    main()

Versions

chromadb==0.4.5 langchain==0.0.264 python==3.10.12 MacOS==13.3.1

Relevant log output

ids_1: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46']
ids_2: ['47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107']
count before 107
Delete of nonexisting embedding ID: 1
Delete of nonexisting embedding ID: 2
Delete of nonexisting embedding ID: 3
Delete of nonexisting embedding ID: 4
Delete of nonexisting embedding ID: 5
Delete of nonexisting embedding ID: 6
Delete of nonexisting embedding ID: 7
Delete of nonexisting embedding ID: 8
Delete of nonexisting embedding ID: 9
Delete of nonexisting embedding ID: 10
Delete of nonexisting embedding ID: 11
Delete of nonexisting embedding ID: 12
Delete of nonexisting embedding ID: 13
Delete of nonexisting embedding ID: 14
Delete of nonexisting embedding ID: 15
Delete of nonexisting embedding ID: 16
Delete of nonexisting embedding ID: 17
Delete of nonexisting embedding ID: 18
Delete of nonexisting embedding ID: 19
Delete of nonexisting embedding ID: 20
Delete of nonexisting embedding ID: 21
Delete of nonexisting embedding ID: 22
Delete of nonexisting embedding ID: 23
Delete of nonexisting embedding ID: 24
Delete of nonexisting embedding ID: 25
Delete of nonexisting embedding ID: 26
Delete of nonexisting embedding ID: 27
Delete of nonexisting embedding ID: 28
Delete of nonexisting embedding ID: 29
Delete of nonexisting embedding ID: 30
Delete of nonexisting embedding ID: 31
Delete of nonexisting embedding ID: 32
Delete of nonexisting embedding ID: 33
Delete of nonexisting embedding ID: 34
Delete of nonexisting embedding ID: 35
Delete of nonexisting embedding ID: 36
Delete of nonexisting embedding ID: 37
Delete of nonexisting embedding ID: 38
Delete of nonexisting embedding ID: 39
Delete of nonexisting embedding ID: 40
Delete of nonexisting embedding ID: 41
Delete of nonexisting embedding ID: 42
Delete of nonexisting embedding ID: 43
Delete of nonexisting embedding ID: 44
Delete of nonexisting embedding ID: 45
Delete of nonexisting embedding ID: 46
count after 61

Process finished with exit code 0
qyzhizi commented 1 year ago

I have the problem too

mickey-lyx commented 1 year ago

@tazarov Hi, could you please look at this problem? Thank you for you time!

tazarov commented 1 year ago

@mickey-lyx, thanks for reporting this. I'll take a look at this soon. At a glance, the code looks fine, and the actual result seems to be fine - you have 61 docs once you remove 47 from the starting 107. All in all, this seems like a warning, not an actual bug. The I will have a look and let you know.

mickey-lyx commented 1 year ago

@tazarov Really appreciate it. The result is right. I'm just wondering why there appears to be warnings of deleting nonexisting embeddings. Is it because the embeddings were deleted multiple times?

guyko81 commented 1 year ago

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

becklabs commented 1 year ago

I'm having the same issue. This seems to occur even when an empty list is passed as ids to Collection.delete.

jeffchuber commented 1 year ago

We'd love to get this fixed - is anyone able to help post a minimal repro?

mickey-lyx commented 1 year ago

@jeffchuber

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

def main():
    client = chromadb.PersistentClient(path="./chroma_db")
    collection = client.get_or_create_collection(name="test", embedding_function=OpenAIEmbeddingFunction())

    num_1 = 47
    num_2 = 70

    texts_1 = [f"text_1.{i}" for i in range(num_1)]
    ids_1 = [f"1.{i}" for i in range(num_1)]
    texts_2 = [f"text_2.{i}" for i in range(num_2)]
    ids_2 = [f"2.{i}" for i in range(num_2)]

    collection.add(ids=ids_1, documents=texts_1)
    collection.add(ids=ids_2, documents=texts_2)

    print("count before", collection.count())
    collection.delete(ids_1)
    print("count after", collection.count())

if __name__ == '__main__':
    main()
timothymugayi commented 12 months ago

I'm seeing similar warnings, but I'm unsure if I should be concerned since it's a warning. It would be good to get some insights to why this occurs even after uploading a few PDF files and while the fastapi is idle, keeps logging.

112-49d5-a776-2c02c03897e8:77661df1-86bc-4f33-9119-a90d77f7c24e
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:484a228b-de38-4674-8f14-078f4f218afd
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:51c75801-6ecd-4490-941e-8ee6f2229476
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:282cb350-257b-49ef-ae55-ab3997099d58
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:fe9d8119-b72a-44c1-9bc5-f5c173621a4b
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:c92f759d-f0e7-46e9-9156-e5c47e917de7
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5be4bf1c-7c02-4815-9c25-de4463b0231f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:32500766-ceb7-4b12-8e8d-04b34306f30f
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:7e5d60fd-cb8a-4ecf-adf3-8d86694458e8
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-a776-2c02c03897e8:5cfbdc44-cc08-4749-8d5d-d628f6aa4676
chroma               | 2023-09-16 15:22:31 WARNING  chromadb.segment.impl.vector.brute_force_index Delete of nonexisting embedding ID: c441314d-7112-49d5-

package versions

chromadb==0.4.10 langchain==0.0.225

Running chroma client server with the latest Docker version

  chroma:
    container_name: chroma
    image: ghcr.io/chroma-core/chroma:latest
    volumes:
      - index_data:/chroma/chroma
    environment:
      - IS_PERSISTENT=true
      - CHROMA_SERVER_HTTP_PORT=8000
    restart: unless-stopped
    ports:
      - '8000:8000'
    networks:
      - mynetwork
chrispangg commented 12 months ago

I have the same issue, and running queries on the db triggers this warning every time. What I did is selected items based on where statement (no ID was given) and removed them one-by-one:

my_collection.delete(
            where={"file_id": str(file_id)}
        )

Since then the warning is shown every time I query it.

I am having this exact issue too

tazarov commented 12 months ago

@jeffchuber, @chrispangg, @timothymugayi, @mickey-lyx, As I mentioned above, the issue is benign. Chroma maintains a temporary index of embeddings before it flushes it to disk after it reaches a certain threshold. In your example, the threshold is reached (100) so the temp index is flushed and cleared, and subsequent entries are appended to it, but when delete comes right after add Chroma attempts to remove any and all embeddings from the temporary index which leads to the warning you see. I have made a fix to properly check if ids to be removed are part of the temp index and if not Chroma will not attempt deletion.

PR's on the way.

tazarov commented 10 months ago

@HammadB I think we can close this now.

s-peryt commented 5 months ago

I think this issue is still present. I've just stumbled upon it in my application. And I'm using latest (0.4.24) version of Chroma, so the fix from #1150 should probably be already merged.

running-frog commented 4 months ago

我更新了chromadb==0.5.0,但还是有这个问题: 我是用threading更新的: t=threading.Thread(target=mydb.add_collection_from_file,args=[local_f],daemon=True) t.start()

tazarov commented 4 months ago

@running-frog, @s-peryt, we have a bug in the HNSW binary index that, under certain conditions, can result in the above errors. There is a PR - #2062 that should resolve this.