HNSW index growth is unbounded (which is by design as graph removals are expensive). Chroma currently uses the default configuration for allow_replace_deleted which is set to False.
A frequently changing index (e.g., many add/delete, possibly also update) causes the size of therow, but the growth is not limited to the storage footprint but also to memory as the whole index (including deleted IDs) is HNSW index to g loaded into memory.
Describe the proposed solution
Three options with varying difficulty:
enable allow_replace_deleted (low complexity, also backward compatible if we add this to default collection config) - note: performance of such an approach should be measured
Rebuild index (med complexity) - add a new CLI command to rebuild the binary index by copying all active vectors and labels to a new index - note: slow and memory consuming (at most 2x), but might (to be tested) offer better performance
Add pruning functionality to the HNSW lib directly (high complexity)
Alternatives considered
No response
Importance
i cannot use Chroma without it
Additional Information
from typing import Any, Dict
import os
import psutil
import chromadb
client = chromadb.PersistentClient("hnsw_leak")
def get_col():
return client.get_or_create_collection("hnsw_leak1")
def add_items(ids,documents):
col = get_col()
col.add(ids=ids, documents=documents)
def delete_items(ids: list):
col = get_col()
col.delete(ids=ids)
def query_items(query):
col = get_col()
return col.query(query_texts=[query], n_results=10)
def get_process_memory_usage() -> float:
return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
import uuid
for i in range(100):
items = {
"ids": [f"{uuid.uuid4()}" for _ in range(500)],
"documents": [f"this document is about {uuid.uuid4()}" for _ in range(500)]
}
add_items(ids=items["ids"], documents=items["documents"])
print("After add", get_process_memory_usage())
query_items("this document is about 4")
print("After query", get_process_memory_usage())
delete_items(items["ids"])
print("After delete", get_process_memory_usage())
add_items(ids=items["ids"], documents=items["documents"])
At this point, measuring the size of the HNSW index dir will reveal that it is much larger than similar collection with only 500 added embeddings.
Describe the problem
HNSW index growth is unbounded (which is by design as graph removals are expensive). Chroma currently uses the default configuration for
allow_replace_deleted
which is set toFalse
.A frequently changing index (e.g., many add/delete, possibly also update) causes the size of therow, but the growth is not limited to the storage footprint but also to memory as the whole index (including deleted IDs) is HNSW index to g loaded into memory.
Describe the proposed solution
Three options with varying difficulty:
allow_replace_deleted
(low complexity, also backward compatible if we add this to default collection config) - note: performance of such an approach should be measuredAlternatives considered
No response
Importance
i cannot use Chroma without it
Additional Information
At this point, measuring the size of the HNSW index dir will reveal that it is much larger than similar collection with only 500 added embeddings.