amikos-tech / chromadb-data-pipes

ChromaDB Data Pipes πŸ–‡οΈ - The easiest way to get data into and out of ChromaDB
MIT License
8 stars 1 forks source link
ai chromadb machine-learning ml mlops pipeline

ChromaDB Data Pipes πŸ–‡οΈ - The easiest way to get data into and out of ChromaDB

ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of "do one thing and do it well".



pip install chromadb-data-pipes


Get help:

cdp --help

Example Use Cases

This is a short list of use cases to evaluate whether this is the right tool for your needs:


Import data from HuggingFace Datasets to .jsonl file:

cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl

Import data from HuggingFace Datasets to Chroma DB:

The below command will import the train split of the given dataset to Chroma chroma-qna chroma-qna collection. The collection will be created if it does not exist and documents will be upserted.

cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create

Importing from a directory with PDF files into Local Persisted Chroma DB:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create

Note: The above command will import the first PDF file from the sample-data/papers/ directory, chunk it into 500 word chunks, embed each chunk and import the chunks to the my-pdfs collection in Chroma DB.


Export data from Local Persisted Chroma DB to .jsonl file:

The below command will export the first 10 documents from the chroma-qna collection to chroma-qna.jsonl file.

cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl

Export data from Local Persisted Chroma DB to .jsonl file with filter:

The below command will export data from local persisted Chroma DB to a .jsonl file using a where filter to select the documents to export.

cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl

Export data from Chroma DB to HuggingFace Datasets:

The below command will export the first 10 documents with offset 10 from the chroma-qna collection to HuggingFace Datasets tazarov/chroma-qna dataset. The dataset will be uploaded to HF.

HF Auth and Privacy: Make sure you have HF_TOKEN=hf_.... environment variable set. If you want your dataset to be private, add --private flag to the cdp ds-put command.

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"

To export a dataset to a file, use --uri with file:// prefix:

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"

File Location The file is relative to the current working directory.


Copy collection from one Chroma collection to another and re-embed the documents:

cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create

Note: See Embedding Processors for more info about supported embedding functions.

Import dataset from HF to Local Persisted Chroma and embed the documents:

cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create

Chunk Large Documents:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500


Count the number of documents in a collection:

cdp export "http://localhost:8000/chroma-qna" | wc -l