Open adithya-s-k opened 2 days ago
Hey @adithya-s-k ~!
Thanks for submitting the issue 😄
Here are some of my comments (in no particular order):
embedding
and have a BaseEmbedding
class that would implement the encode
function. The BaseEmbedding
should also store the tokenizer
for the embedding -- which would be useful when we have embeddings that don't come with their own tokenizers. token-encodings
as well, since we need those for LateChunking
that would be adding in the future. Hey @bhavnicksm ~! Thanks for the detailed feedback! Here’s my take on the points you raised:
Embedding Provider Structure: Agreed, creating a structured setup with a sub-directory named embedding
makes sense. The BaseEmbedding
class could define an encode
method and store the tokenizer if applicable. This approach would allow us to keep the logic modular, making it easy to extend support for other embedding models in the future.
Benchmarks for FastEmbed: I’ll run some initial benchmarks comparing FastEmbed against an optimized SentenceTransformer, focusing on single-instance encoding to get a pure performance view. FastEmbed does seem broadly applicable, but I’ll review any specific limitations (e.g., model restrictions or API specifics) that might impact compatibility with LateChunking
.
Flexible Input Support: Allowing SemanticChunkers
to access BaseEmbeddings
or simply initialize with a string identifier sounds practical. This will provide flexibility for on-the-fly instantiation, especially beneficial when users dynamically switch between different embedding providers.
FastEmbed as Default: We can decide on making FastEmbed the default once we verify its performance and compatibility for token encoding, as you mentioned. If it delivers the token-encoding functionality we need for LateChunking
, it could be a strong candidate.
I’ll move ahead with the structure you proposed and keep you updated on benchmarks and potential limitations. Let me know if there are any other considerations I should factor in as I start on this!
@adithya-s-k,
Yes! LGTM!
I have an initial BaseEmbedding
in mind, which I plan to add in the next couple of hours - if it doesn't need any refinement.
This is roughly what it would be like:
class BaseEmbeddings(ABC):
"""Base class for all embedding implementations"""
@abstractmethod
def encode(self, texts: Union[str, List[str]]):
"""Encode text into embeddings"""
raise NotImplementedError
@abstractmethod
def get_token_count(self, text: Union[str, List[str]]):
"""Get token count for text"""
raise NotImplementedError
@property
@abstractmethod
def dimension(self) -> int:
"""Get embedding dimension"""
raise NotImplementedError
@classmethod
def is_available(cls) -> bool:
"""
Check if this embeddings implementation is available (dependencies installed).
Override this method to add custom dependency checks.
"""
return True
Added get_token_count
instead of a tokenizer function for now, still considering if I should go with tokenizer
or token_counters
, since most chunking methods beyond TokenChunker
do not use the tokenizer
for encoding or decoding but only to count tokens.
Plus, token_counter
namespace has the added benefit of allowing the user to implement random functions of the type def custom_token_counter(text: str) -> int
as a replacement for tokenizers
What are your thoughts on the BaseEmbeddings
? Do you think we need to add or remove something here?
Hey @adithya-s-k!
Added BaseEmbeddings in #24, have a look!
Currently, Chonkie uses sentence-transformers for generating embeddings in semantic chunking. While this works well, FastEmbed offers several advantages that could enhance Chonkie's capabilities:
Proposed Changes
Implementation Details
The changes will:
pip install chonkie[fastembed]
Benefits
Questions
Tasks