[Feature Request]: easy chunking, and token limit based retrieval

swyxio commented 1 year ago

EDIT: i understand that both of the following are "merely" nice to have features, but i think that part of Chroma's appeal is being able to go pretty far out of the box. what's at stake here is deciding what is "table stakes" for an open source vector database with great developer experience in 2023, and if they are easy to implement, make it out of the box for Chroma.

i think the following are both easy and high value.

these look like two feature requests, but i have put them in one, because they are two most common pain points of the read and write experience with Chroma right now that either causes errors down the line or necessitates wrapper libraries.

Proposal 1: Easy/Default Chunking

Describe the problem

people are inadvisably throwing entire documents into embedding functions because they dont know about chunking. this leads to bad out of the box results.

each embedding function is known good for a range of chunk sizes (eg 128 to 1024 tokens).

Describe the proposed solution

just like we offer sentence transformer embeddings out of the box, we could ship some basic chunking strategies that are either:

on by default
paired with the choice of embedding
opt-in

Proposal 2: Token limit based retrieval

Describe the problem

LLM Apis have limited context length (davinci-03 - 4097 tokens). many people will happily query docs with n_result limits, and then dump them right into the prompt.

Describe the proposed solution

collection.query(query_texts=[query], max_tokens=3000)

ofc theres nuance with how you truncate and distribute these but we win by making good default decisions for people

tazarov commented 1 year ago

Hey @swyxio,

Regarding Proposal 1: I think this is something the team is actively considering. However, until "chunking" arrives, I think we can still have some tiny but effective strategies to let developers know about the max input size of an EF:

enhance the EF interface with a max_input_size()->int that will return the maximum input size of each EF.
enhance the existing handful of EFs with a warning message (and maybe an exception with some env var enabled) when an attempt is made to embed a text that exceeds the maximum input size

Note: This is not a silver bullet, but it still gives some indication to users. I acknowledge that for HF models, there's a "innumerable" EFs that each has its own input size, but let's leave that aside for a moment.

Regarding your second proposal - I feel this is more of a problem for Langchain and LlamaIndex than for Chroma. But the point still stands that good docs, warnings etc. could go a long way for DX.

It is important to note that in Chroma at the document level, there is no concept of "tokens". That said there is possibly room for some tooling that can achieve what you're asking for.

jeffchuber commented 1 year ago

will be addressed by Pipelines CIP and its descendents

chroma-core / chroma