amikos-tech / chromadb-data-pipes

ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB
https://datapipes.chromadb.dev/
MIT License
8 stars 1 forks source link

cdp export only returns 100 documents #147

Open busbaby opened 1 month ago

busbaby commented 1 month ago

According to your documentation, the following command should "Count the number of documents in a collection":

cdp export "http://localhost:8000/chroma-qna" | wc -l

However, when I try this on my collection, which has over 1,000,000 documents, I only ever get count 100.

tazarov commented 1 month ago

@busbaby, let me have a look at this.

tazarov commented 1 month ago

@busbaby, I'm able to reproduce this:

(chromadb-hfds-py3.9) [chroma-hfds]cdp export file://testds-100k/chroma-data/test | wc -l                                                                                                                                                                                                                                                                                                             
     100

let me fix it.

tazarov commented 1 month ago

@busbaby, release 0.0.10 should fix the issue.

pip install -U chromadb-data-pipes

or

pip install chromadb-data-pipes==0.0.10
busbaby commented 1 month ago

Thank you for the quick response and reaction. I can confirm this fix worked:

time cdp export "http://localhost:8000/cases" | wc -l
1053957

real    59m51.970s
user    9m22.754s
sys     0m32.127s

Our team was hoping to use your tool for extraction and backup purposes. However, it does take quite some time to complete. Would it be possible to speed up export? I can open a separate enhancement request if you like.

tazarov commented 1 month ago

@busbaby, can you also try with this:

time cdp export "http://localhost:8000/cases" --batch-size 10000 | wc -l

The --batch-size controls how big of chunks the tool will fetch from Chroma. Note that larger chunks take longer to serialize/deserialize but are more efficient.

tazarov commented 1 month ago

here's how it looks on my system (with local DB, http client will add a bit of overhead):

(chromadb-hfds-py3.9) [chroma-hfds] time cdp export file://testds-5M/chroma-data/test --batch-size 10000| wc -l
 5000000
cdp export file://testds-5M/chroma-data/test --batch-size 10000  416.18s user 62.70s system 72% cpu 11:01.79 total
wc -l  1.58s user 5.37s system 1% cpu 11:01.79 total

11mins for 5M docs DB, still not extremely fast, but the next level optimization is to go to the actual indices (sqlite3 for metadata and hnsw for vectors)

busbaby commented 1 month ago

Interesting. I'll play around with the batch size to see if it can perform within the time window we need. Just a thought; would there be a performance boost to offload each batch fetch to a threadpool, and the main thread can aggregate the results? A top level change without having to get under the hood.

tazarov commented 1 month ago

@busbaby, I have the thread-pooling only for imports:

https://github.com/amikos-tech/chromadb-data-pipes/blob/4c47bdb56ef76a8121331bb1f0dec82f59b9dc2b/chroma_dp/chroma/chroma_import.py#L111

I can add it for exports too.