Open swyxio opened 1 year ago
Hey @swyxio,
Regarding Proposal 1: I think this is something the team is actively considering. However, until "chunking" arrives, I think we can still have some tiny but effective strategies to let developers know about the max input size of an EF:
max_input_size()->int
that will return the maximum input size of each EF. Note: This is not a silver bullet, but it still gives some indication to users. I acknowledge that for HF models, there's a "innumerable" EFs that each has its own input size, but let's leave that aside for a moment.
Regarding your second proposal - I feel this is more of a problem for Langchain and LlamaIndex than for Chroma. But the point still stands that good docs, warnings etc. could go a long way for DX.
It is important to note that in Chroma at the document level, there is no concept of "tokens". That said there is possibly room for some tooling that can achieve what you're asking for.
will be addressed by Pipelines CIP and its descendents
EDIT: i understand that both of the following are "merely" nice to have features, but i think that part of Chroma's appeal is being able to go pretty far out of the box. what's at stake here is deciding what is "table stakes" for an open source vector database with great developer experience in 2023, and if they are easy to implement, make it out of the box for Chroma.
i think the following are both easy and high value.
these look like two feature requests, but i have put them in one, because they are two most common pain points of the read and write experience with Chroma right now that either causes errors down the line or necessitates wrapper libraries.
Proposal 1: Easy/Default Chunking
Describe the problem
people are inadvisably throwing entire documents into embedding functions because they dont know about chunking. this leads to bad out of the box results.
each embedding function is known good for a range of chunk sizes (eg 128 to 1024 tokens).
Describe the proposed solution
just like we offer sentence transformer embeddings out of the box, we could ship some basic chunking strategies that are either:
Proposal 2: Token limit based retrieval
Describe the problem
LLM Apis have limited context length (davinci-03 - 4097 tokens). many people will happily query docs with n_result limits, and then dump them right into the prompt.
Describe the proposed solution
ofc theres nuance with how you truncate and distribute these but we win by making good default decisions for people