chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.99k stars 1.26k forks source link

[Bug]: Inserting embeddings is really slow #932

Closed kmilacic closed 1 year ago

kmilacic commented 1 year ago

What happened?

I am connecting to Chroma 0.4.3 server through langchain library. Before I used v0.3.22 and the speed was okay, only problem was with Clickhouse and occasional errors. I read your blog about new version and therefore I decided to test it. However, it behaves strangely. After couple of insert calls, insert operation becomes really slow. Here are some speed related logs:

index document with embedding model: distiluse-base-multilingual-cased-v1
Time elapsed for creating embeddings (total 3602): 128.56343865394592s
Time elapsed for inserting: 885.6111407279968s

index document with embedding model: distiluse-base-multilingual-cased-v1
Time elapsed for creating embeddings (total 260): 30.178582191467285s
Time elapsed for inserting: 63.60845708847046s

index document with embedding model: distiluse-base-multilingual-cased-v1
Time elapsed for creating embeddings (total 4167): 124.1262264251709s
Time elapsed for inserting: 1249.1344158649445s

index document with embedding model: distiluse-base-multilingual-cased-v1
Time elapsed for creating embeddings (total 326): 29.45905041694641s
Time elapsed for inserting: 91.80317664146423s

index document with embedding model: distiluse-base-multilingual-cased-v1
Time elapsed for creating embeddings (total 54): 7.797564506530762s
Time elapsed for inserting: 16.45245385169983s

I am creating embeddings in my app, and then sending them to Chroma server. Both Chroma and my app are on the same server. I am running them using docker. Waiting 15-20 minutes for inserting 3-4 thousand embeddings definitely seems too long. Do you maybe have an idea what could be the problem?

Edit: Something I also noticed is that docker volume doesn't keep the data after container is destroyed. Line index_data:/index_data seems to be the problem. I used docker-compose.yml from your repository.

Versions

langchain v0.0.240, chromadb v0.4.3, ubuntu 20.04, Python 3.9

Relevant log output

No response

HammadB commented 1 year ago

Can you share the configuration of your server? (RAM/ CPU etc)

grungert commented 1 year ago

Hi, I'm a colleague of @kmilacic. Server configuration is:

RAM: 8GB Details: *-memory description: System Memory physical id: 1000 size: 8GiB capabilities: ecc configuration: errordetection=multi-bit-ecc

CPU: 4 core Details: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: GenuineIntel BIOS Vendor ID: QEMU Model name: DO-Regular BIOS Model name: pc-i440fx-6.1 CPU @ 2.0GHz BIOS CPU family: 1 CPU family: 6 Model: 63 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Stepping: 2 BogoMIPS: 4589.21

ompanda commented 1 year ago

observed similar behavior for csv file with 100000 record having 30MB size. it was taking too long and unable to load complete file

jzombie commented 1 year ago

I wanted to point out that @ompanda has raised some related issues in the following repositories:

Additionally, I'm pondering whether the observed slowness might be linked to the swap-file usage due to limited RAM. Running ChromaDB on my M1 Pro [without using Docker] with 16 GB RAM, alongside transformers models and spaCy, I've seen significant swap-file activity. After a few queries on a nearly empty database, the memory consumption appears to spike considerably. I haven't pinpointed this directly to the database, but it's something to consider.

Given these observations, I'd venture to say that an 8GB RAM setup might not be sufficient for optimal performance, especially if other processes are also drawing on those resources. And if ChromaDB is operating within a Docker container on a virtualized environment, like a Linux container on a Mac, I'd anticipate even more pronounced performance challenges.

jzombie commented 1 year ago

Furthermore, the CPU mentioned here, if this analysis is correct, is from around 2013, and I'm curious how that would also impact performance.

------ [ChatGPT] ------

From the details provided, some key information to identify the release year of a CPU would typically be the "Vendor ID", "Model name", "CPU family", and "Model".

Given that:

The model number 63 under CPU family 6 corresponds to Intel's Haswell microarchitecture. Intel's Haswell processors were introduced in 2013.

Therefore, based on the details provided, the CPU likely belongs to Intel's Haswell generation, which was released in 2013. However, it's important to note that the exact model or specifics might vary, and the data provided also suggests it's running in a virtual environment (e.g., QEMU), so this is a virtual CPU rather than a physical one.

HammadB commented 1 year ago

Hi all, please try again with the release of v0.4.6 which comes with several large performance improvements.

Offek commented 10 months ago

@HammadB Experiencing a similar issue here on version 0.4.18. I'm trying to add many embedding items (in the millions) in batches of 40k. Every run is slower than the previous run, until it becomes really really slow and will just take too long to insert. Tried multiple embedding sizes and the issue persists. The machine I'm testing on has a 13th Gen Intel, plenty of RAM and a super fast M.2 storage. I think the issue comes from sqlite or some indexing. I see it is using a single process and the indexing might be too complex for so many items.