Add FastEmbed Support for Embedding Generation/Inference

adithya-s-k commented 2 days ago

Currently, Chonkie uses sentence-transformers for generating embeddings in semantic chunking. While this works well, FastEmbed offers several advantages that could enhance Chonkie's capabilities:

Broader Model Support: FastEmbed supports more embedding models out of the box
Better Performance: FastEmbed includes built-in optimizations for faster embedding generation
Additional Features: Automatic batching, caching, and GPU support come built-in
Lightweight: FastEmbed has minimal dependencies and is optimized for production use

Proposed Changes

Add FastEmbed as an optional dependency
Modify SemanticChunker and SDPMChunker to support FastEmbed models
Create an abstraction layer for embedding providers
Update documentation to reflect new capabilities

Implementation Details

The changes will:

Create an EmbeddingProvider abstract class
Implement concrete providers for both sentence-transformers and FastEmbed
Update chunker classes to use the provider abstraction
Add installation option: pip install chonkie[fastembed]

Benefits

Faster embedding generation
More model options for users
Better resource utilization
Maintained backward compatibility

Questions

Should we make FastEmbed the default embedding provider?
Should we support mixing providers in the same project?

Tasks

[ ] Create embedding provider abstraction
[ ] Implement FastEmbed provider
[ ] Update semantic chunkers
[ ] Add tests
[ ] Update documentation
[ ] Update benchmarks to include FastEmbed performance

bhavnicksm commented 2 days ago

Hey @adithya-s-k ~!

Thanks for submitting the issue 😄

Here are some of my comments (in no particular order):

Let's create a structured way to add embedding providers as you mentioned, that makes sense. I will probably create a sub-directory in the package named embedding and have a BaseEmbedding class that would implement the encode function. The BaseEmbedding should also store the tokenizer for the embedding -- which would be useful when we have embeddings that don't come with their own tokenizers.
We need benchmarks on the speed-up of FastEmbed before we make it the default provider. Possibly benchmarks comparing the fastest optimized SentenceTransformer vs. FastEmbed on just encoding text serially (without chunking) would be enough. Could you tell me if there are any limitations to using it? Is it as generally applicable as SentenceTransformers is?
SemanticChunkers would be able to access BaseEmbeddings or str as input and can initialize on the fly if required.
For us to integrate FastEmbed as a default, it should be able to output token-encodings as well, since we need those for LateChunking that would be adding in the future.

adithya-s-k commented 18 hours ago

Hey @bhavnicksm ~! Thanks for the detailed feedback! Here’s my take on the points you raised:

Embedding Provider Structure: Agreed, creating a structured setup with a sub-directory named embedding makes sense. The BaseEmbedding class could define an encode method and store the tokenizer if applicable. This approach would allow us to keep the logic modular, making it easy to extend support for other embedding models in the future.
Benchmarks for FastEmbed: I’ll run some initial benchmarks comparing FastEmbed against an optimized SentenceTransformer, focusing on single-instance encoding to get a pure performance view. FastEmbed does seem broadly applicable, but I’ll review any specific limitations (e.g., model restrictions or API specifics) that might impact compatibility with LateChunking.
Flexible Input Support: Allowing SemanticChunkers to access BaseEmbeddings or simply initialize with a string identifier sounds practical. This will provide flexibility for on-the-fly instantiation, especially beneficial when users dynamically switch between different embedding providers.
FastEmbed as Default: We can decide on making FastEmbed the default once we verify its performance and compatibility for token encoding, as you mentioned. If it delivers the token-encoding functionality we need for LateChunking, it could be a strong candidate.

I’ll move ahead with the structure you proposed and keep you updated on benchmarks and potential limitations. Let me know if there are any other considerations I should factor in as I start on this!

bhavnicksm commented 18 hours ago

@adithya-s-k,

Yes! LGTM!

I have an initial BaseEmbedding in mind, which I plan to add in the next couple of hours - if it doesn't need any refinement.

This is roughly what it would be like:

class BaseEmbeddings(ABC):
    """Base class for all embedding implementations"""

    @abstractmethod
    def encode(self, texts: Union[str, List[str]]):
        """Encode text into embeddings"""
        raise NotImplementedError

    @abstractmethod
    def get_token_count(self, text: Union[str, List[str]]):
        """Get token count for text"""
        raise NotImplementedError

    @property
    @abstractmethod
    def dimension(self) -> int:
        """Get embedding dimension"""
        raise NotImplementedError

    @classmethod
    def is_available(cls) -> bool:
        """
        Check if this embeddings implementation is available (dependencies installed).
        Override this method to add custom dependency checks.
        """
        return True

Added get_token_count instead of a tokenizer function for now, still considering if I should go with tokenizer or token_counters, since most chunking methods beyond TokenChunker do not use the tokenizer for encoding or decoding but only to count tokens.

Plus, token_counter namespace has the added benefit of allowing the user to implement random functions of the type def custom_token_counter(text: str) -> int as a replacement for tokenizers

What are your thoughts on the BaseEmbeddings? Do you think we need to add or remove something here?

bhavnicksm commented 15 hours ago

Hey @adithya-s-k!

Added BaseEmbeddings in #24, have a look!

bhavnicksm / chonkie