Open busbaby opened 1 month ago
@busbaby, let me have a look at this.
@busbaby, I'm able to reproduce this:
(chromadb-hfds-py3.9) [chroma-hfds]cdp export file://testds-100k/chroma-data/test | wc -l
100
let me fix it.
@busbaby, release 0.0.10
should fix the issue.
pip install -U chromadb-data-pipes
or
pip install chromadb-data-pipes==0.0.10
Thank you for the quick response and reaction. I can confirm this fix worked:
time cdp export "http://localhost:8000/cases" | wc -l
1053957
real 59m51.970s
user 9m22.754s
sys 0m32.127s
Our team was hoping to use your tool for extraction and backup purposes. However, it does take quite some time to complete. Would it be possible to speed up export? I can open a separate enhancement request if you like.
@busbaby, can you also try with this:
time cdp export "http://localhost:8000/cases" --batch-size 10000 | wc -l
The --batch-size
controls how big of chunks the tool will fetch from Chroma. Note that larger chunks take longer to serialize/deserialize but are more efficient.
here's how it looks on my system (with local DB, http client will add a bit of overhead):
(chromadb-hfds-py3.9) [chroma-hfds] time cdp export file://testds-5M/chroma-data/test --batch-size 10000| wc -l
5000000
cdp export file://testds-5M/chroma-data/test --batch-size 10000 416.18s user 62.70s system 72% cpu 11:01.79 total
wc -l 1.58s user 5.37s system 1% cpu 11:01.79 total
11mins for 5M docs DB, still not extremely fast, but the next level optimization is to go to the actual indices (sqlite3 for metadata and hnsw for vectors)
Interesting. I'll play around with the batch size to see if it can perform within the time window we need. Just a thought; would there be a performance boost to offload each batch fetch to a threadpool, and the main thread can aggregate the results? A top level change without having to get under the hood.
@busbaby, I have the thread-pooling only for imports:
I can add it for exports too.
According to your documentation, the following command should "Count the number of documents in a collection":
However, when I try this on my collection, which has over 1,000,000 documents, I only ever get count 100.