chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.51k stars 1.21k forks source link

How can I avoid writing duplicate data #2588

Open Amphetaminewei opened 1 month ago

Amphetaminewei commented 1 month ago

In my scenario, I would try to extract the tags of a file and store the tag vector, for different files, the tags may be duplicated, and we don't want to save the duplicate tags. At present, we use the same ID for the same label to make the label vector not stored repeatedly, but when calling add, is often printed: Add of existing embedding ID: *. This log causes my log file to be very large and makes me wonder if my usage is wrong. Is there a better way to avoid storing duplicate vectors? Or is there a way to eliminate this log?

tazarov commented 1 month ago

@Amphetaminewei, Can you tell me what Chroma version you are using? The warning log entry you encounter is triggered by add() operation; it is expected to appear if the ID already exists. To avoid this, you can query for the ID get(ids=["my_tag_id"],include=[]). As an alternative, you can use upsert(), but that will update the record in Chroma, which means that your tag's embedding will be regenerated and inserted again (probably not optimal).

As a last resort, you can suppress the warning messages:

import logging

logger = logging.getLogger('chromadb.segment.impl.vector.local_hnsw')
logger1 = logging.getLogger('chromadb.segment.impl.metadata.sqlite')

logger.setLevel(logging.ERROR)
logger1.setLevel(logging.ERROR)
Amphetaminewei commented 1 month ago

@Amphetaminewei, Can you tell me what Chroma version you are using?

Sorry I forgot to provide version information, the version of Chroma I'm using is 0.5.0. If I use get() or upsert(), will the performance be worse than if I used add() directly? Turning off the warning message is a dangerous act, and if there is no other way, we may have to keep the status quo.

tazarov commented 1 month ago

hey @Amphetaminewei, let's examine the "costs" associated with each:

Overall, I'd say the get() (batch it over many IDs if possible) is a sensible approach, especially if you have many requests that end up in this situation. You can even cache things on the client side to avoid the roundtrip and the SQLite query altogether.

Another thing about logging is that you can rotate the logs thus keeping the size low.

Amphetaminewei commented 1 month ago

@tazarov thank you, i think get() and cache is a better choice to me, i'll give it a try in my program.

Amphetaminewei commented 1 month ago

hey @tazarov , i tested the methods mentioned above and found a problem. I added a non-existent Id to collection, and still the log prompt "Add of existing embedding ID:", here's my program and output:

import chromadb

client = chromadb.PersistentClient(path="/home/wangweinan/.local/kylin-ai-business-framework/datamanagement/database/search")
collection = client.get_collection(name="files-tags")
# reply = collection.get()
# print(reply)

vector = [0] * 1024
collection.add(ids=["1000"], documents=["999"], metadatas=[{"tags": "999"}], embeddings=[vector])

output:

(python-env) wangweinan@wangweinan-xiaoxinpro14imh9:~/prj/test-onnxruntime$ python3 ./testchroma.py 
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 1
Add of existing embedding ID: 27
Add of existing embedding ID: 68
Add of existing embedding ID: 88
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 84
Add of existing embedding ID: 9
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 9
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 74
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 1
Add of existing embedding ID: 80
Add of existing embedding ID: 81
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 1
Add of existing embedding ID: 68
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 36
Add of existing embedding ID: 149
Add of existing embedding ID: 150
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 54
Add of existing embedding ID: 55
Add of existing embedding ID: 56
Add of existing embedding ID: 57
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 164
Add of existing embedding ID: 165
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 75
Add of existing embedding ID: 76
Add of existing embedding ID: 78
Add of existing embedding ID: 1
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 99
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 3
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 9
Add of existing embedding ID: 26
Add of existing embedding ID: 84
Add of existing embedding ID: 27
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 135
Add of existing embedding ID: 137
Add of existing embedding ID: 138
Add of existing embedding ID: 242
Add of existing embedding ID: 30
Add of existing embedding ID: 31
Add of existing embedding ID: 32
Add of existing embedding ID: 33
Add of existing embedding ID: 34
Add of existing embedding ID: 35
Add of existing embedding ID: 36
Add of existing embedding ID: 60
Add of existing embedding ID: 149
Add of existing embedding ID: 150
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 28
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 282
Add of existing embedding ID: 283
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 75
Add of existing embedding ID: 1
Add of existing embedding ID: 288
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 27
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 135
Add of existing embedding ID: 137
Add of existing embedding ID: 138
Add of existing embedding ID: 341
Add of existing embedding ID: 30
Add of existing embedding ID: 36
Add of existing embedding ID: 28
Add of existing embedding ID: 30
Add of existing embedding ID: 50
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 54
Add of existing embedding ID: 55
Add of existing embedding ID: 56
Add of existing embedding ID: 57
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 1
Add of existing embedding ID: 164
Add of existing embedding ID: 270
Add of existing embedding ID: 271
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 371
Add of existing embedding ID: 372
Add of existing embedding ID: 281
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 74
Add of existing embedding ID: 284
Add of existing embedding ID: 78
Add of existing embedding ID: 1
Add of existing embedding ID: 288
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 203
Add of existing embedding ID: 1
Add of existing embedding ID: 88
Add of existing embedding ID: 1000

chroma version info is:

(python-env) wangweinan@wangweinan-xiaoxinpro14imh9:~/prj/test-onnxruntime$ pip3 show chromadb
Name: chromadb
Version: 0.5.0
Summary: Chroma.
Home-page: 
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: /usr/share/kylin-ai-business-framework/python-env/lib/python3.12/site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: 
tazarov commented 1 month ago

@Amphetaminewei, thanks for sharing. I'll have a look and share a sample of my suggestions for you to try out.

Amphetaminewei commented 1 month ago

@tazarov , i simply modified my sample and found a similar situation:

import chromadb

client = chromadb.PersistentClient(path="/home/wangweinan/.local/kylin-ai-business-framework/datamanagement/database/search")
collection = client.get_collection(name="files-tags")
# reply = collection.get()
# print(reply)

vector = [0] * 1024
# collection.add(ids=["1111"], documents=["999"], metadatas=[{"tags": "999"}], embeddings=[vector])
collection.query(query_embeddings=[vector], n_results=10)

output:

(python-env) wangweinan@wangweinan-xiaoxinpro14imh9:~/prj/test-onnxruntime$ python3 ./testchroma.py 
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 1
Add of existing embedding ID: 27
Add of existing embedding ID: 68
Add of existing embedding ID: 88
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 84
Add of existing embedding ID: 9
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 9
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 1
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 74
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 1
Add of existing embedding ID: 80
Add of existing embedding ID: 81
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 1
Add of existing embedding ID: 68
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 36
Add of existing embedding ID: 149
Add of existing embedding ID: 150
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 54
Add of existing embedding ID: 55
Add of existing embedding ID: 56
Add of existing embedding ID: 57
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 164
Add of existing embedding ID: 165
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 75
Add of existing embedding ID: 76
Add of existing embedding ID: 78
Add of existing embedding ID: 1
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 99
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 3
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 9
Add of existing embedding ID: 26
Add of existing embedding ID: 84
Add of existing embedding ID: 27
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 135
Add of existing embedding ID: 137
Add of existing embedding ID: 138
Add of existing embedding ID: 242
Add of existing embedding ID: 30
Add of existing embedding ID: 31
Add of existing embedding ID: 32
Add of existing embedding ID: 33
Add of existing embedding ID: 34
Add of existing embedding ID: 35
Add of existing embedding ID: 36
Add of existing embedding ID: 60
Add of existing embedding ID: 149
Add of existing embedding ID: 150
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 42
Add of existing embedding ID: 43
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 46
Add of existing embedding ID: 47
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 28
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 282
Add of existing embedding ID: 283
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 75
Add of existing embedding ID: 1
Add of existing embedding ID: 288
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 95
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 9
Add of existing embedding ID: 84
Add of existing embedding ID: 1
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 50
Add of existing embedding ID: 64
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 84
Add of existing embedding ID: 88
Add of existing embedding ID: 27
Add of existing embedding ID: 129
Add of existing embedding ID: 130
Add of existing embedding ID: 131
Add of existing embedding ID: 132
Add of existing embedding ID: 133
Add of existing embedding ID: 134
Add of existing embedding ID: 135
Add of existing embedding ID: 137
Add of existing embedding ID: 138
Add of existing embedding ID: 341
Add of existing embedding ID: 30
Add of existing embedding ID: 36
Add of existing embedding ID: 28
Add of existing embedding ID: 30
Add of existing embedding ID: 50
Add of existing embedding ID: 50
Add of existing embedding ID: 51
Add of existing embedding ID: 30
Add of existing embedding ID: 32
Add of existing embedding ID: 44
Add of existing embedding ID: 45
Add of existing embedding ID: 54
Add of existing embedding ID: 55
Add of existing embedding ID: 56
Add of existing embedding ID: 57
Add of existing embedding ID: 60
Add of existing embedding ID: 61
Add of existing embedding ID: 1
Add of existing embedding ID: 164
Add of existing embedding ID: 270
Add of existing embedding ID: 271
Add of existing embedding ID: 1
Add of existing embedding ID: 64
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 371
Add of existing embedding ID: 372
Add of existing embedding ID: 281
Add of existing embedding ID: 68
Add of existing embedding ID: 69
Add of existing embedding ID: 70
Add of existing embedding ID: 71
Add of existing embedding ID: 72
Add of existing embedding ID: 73
Add of existing embedding ID: 74
Add of existing embedding ID: 284
Add of existing embedding ID: 78
Add of existing embedding ID: 1
Add of existing embedding ID: 288
Add of existing embedding ID: 80
Add of existing embedding ID: 82
Add of existing embedding ID: 200
Add of existing embedding ID: 201
Add of existing embedding ID: 93
Add of existing embedding ID: 203
Add of existing embedding ID: 1
Add of existing embedding ID: 88
Add of existing embedding ID: 1000

I suspect this has something to do with the id I'm using, in other collections I'm using UUIDs and didn't find these issues. I would try more with this.

tazarov commented 1 month ago

@Amphetaminewei, you are right that IDs must be unique. Using UUIDs will most certainly generate unique IDs, thus avoiding the warning message above.

Let me step back for a second and try to grasp your problem domain. Looking at this:

collection.add(ids=["1111"], documents=["999"], metadatas=[{"tags": "999"}], embeddings=[vector])

Are the following assumptions correct?:

A clarifying question: Is your tag a single ID like "999" or can there be more (e.g. metadata=[{"tags":"999,1000..."}]?

Amphetaminewei commented 1 month ago

@tazarov , First of all, answer your questions

You don't care about the ID (which may be why you used UUID in the past)

Ever I used potentially duplicate IDs to avoid adding duplicate embedding, for embedding that couldn't be duplicated, I used UUID.

You want the tag "999" (metadatas=[{"tags": "999"}]) to be unique? (by extension, the vector and the document are also unique, correct?)

yep

Is your tag a single ID like "999" or can there be more (e.g. metadata=[{"tags":"999,1000..."}]?

I only have one metadatas=[{"tag_id":"999"}] for each embedding, in fact this id is added to avoid adding duplicate embeddings, I will get all the tag_id in the current collection via get() and check if there are already duplicate tag_id in the collection before investigating add().

I think the problem was that I wanted to avoid writing duplicate vectors by ID, and later I thought it would be better to use get() with metadata. What intrigues me is that in my case, "Add of existing embedding ID" appears in add() non-existent ID "999" and querying, and when querying the log does not point to the ID I queryed. Is this related to the caching of error logs?

Ao-Last commented 1 month ago

I do not read all threads carefully. But I guess using a separate record manager is one solution to avoid duplicate adding. Langchain's indexing api is an example.

tazarov commented 1 month ago

@Ao-Last, LC's indexing looks like an interesting proposition. As with everything, there are trade-offs:

After trying to understand the problem domain I feel this can be solved quite easily with a simple get() prior to adding.

@Amphetaminewei, use fixed IDs to ensure you get a warning and the addition is ignored by Chroma, then do the following:


tag_to_add = "999"

results = collection.get(where={"tags": tag_to_add})
if len(results["ids"])==0:
  collection.add(ids=[tag_to_add], documents=[tag_to_add], metadatas=[{"tags": tag_to_add}], embeddings=[vector])

get() operation is relatively inexpensive and also quite fast.

Amphetaminewei commented 1 month ago

I'm not using LC and adding more python packages would make my deployment more complicated. I think I know what to do, thank you @tazarov