chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.56k stars 1.3k forks source link

[Feature Request]: HNSW index pruning #2594

Open tazarov opened 3 months ago

tazarov commented 3 months ago

Describe the problem

HNSW index growth is unbounded (which is by design as graph removals are expensive). Chroma currently uses the default configuration for allow_replace_deleted which is set to False.

A frequently changing index (e.g., many add/delete, possibly also update) causes the size of therow, but the growth is not limited to the storage footprint but also to memory as the whole index (including deleted IDs) is HNSW index to g loaded into memory.

Describe the proposed solution

Three options with varying difficulty:

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

from typing import Any, Dict
import os
import psutil
import chromadb

client = chromadb.PersistentClient("hnsw_leak")

def get_col():
    return client.get_or_create_collection("hnsw_leak1")

def add_items(ids,documents):
    col = get_col()
    col.add(ids=ids, documents=documents)

def delete_items(ids: list):
    col = get_col()
    col.delete(ids=ids)

def query_items(query):
    col = get_col()
    return col.query(query_texts=[query], n_results=10)

def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024
import uuid

for i in range(100):
    items = {
        "ids": [f"{uuid.uuid4()}" for _ in range(500)],
        "documents": [f"this document is about {uuid.uuid4()}" for _ in range(500)]
    }
    add_items(ids=items["ids"], documents=items["documents"])
    print("After add", get_process_memory_usage())
    query_items("this document is about 4")
    print("After query", get_process_memory_usage())
    delete_items(items["ids"])
    print("After delete", get_process_memory_usage())

add_items(ids=items["ids"], documents=items["documents"])

At this point, measuring the size of the HNSW index dir will reveal that it is much larger than similar collection with only 500 added embeddings.

import hnswlib
import psutil
import os
def get_process_memory_usage() -> float:
    return psutil.Process(os.getpid()).memory_info().rss / 1024 / 1024

print("Before loading", get_process_memory_usage())
index = hnswlib.Index(space='l2', dim=384)
index.load_index("hnsw_leak/83d9c2ed-5d41-46db-95fb-ad5b1bdcb780/",is_persistent_index=True,max_elements=501)
print("After loading", get_process_memory_usage())
tazarov commented 3 months ago

Refs: