bhavnicksm / chonkie

🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library
https://pypi.org/project/chonkie/
MIT License
1.02k stars 31 forks source link

Add FastEmbed Support for Embedding Generation/Inference #19

Open adithya-s-k opened 2 days ago

adithya-s-k commented 2 days ago

Currently, Chonkie uses sentence-transformers for generating embeddings in semantic chunking. While this works well, FastEmbed offers several advantages that could enhance Chonkie's capabilities:

  1. Broader Model Support: FastEmbed supports more embedding models out of the box
  2. Better Performance: FastEmbed includes built-in optimizations for faster embedding generation
  3. Additional Features: Automatic batching, caching, and GPU support come built-in
  4. Lightweight: FastEmbed has minimal dependencies and is optimized for production use

Proposed Changes

Implementation Details

The changes will:

  1. Create an EmbeddingProvider abstract class
  2. Implement concrete providers for both sentence-transformers and FastEmbed
  3. Update chunker classes to use the provider abstraction
  4. Add installation option: pip install chonkie[fastembed]

Benefits

Questions

Tasks

bhavnicksm commented 2 days ago

Hey @adithya-s-k ~!

Thanks for submitting the issue 😄

Here are some of my comments (in no particular order):

adithya-s-k commented 18 hours ago

Hey @bhavnicksm ~! Thanks for the detailed feedback! Here’s my take on the points you raised:

  1. Embedding Provider Structure: Agreed, creating a structured setup with a sub-directory named embedding makes sense. The BaseEmbedding class could define an encode method and store the tokenizer if applicable. This approach would allow us to keep the logic modular, making it easy to extend support for other embedding models in the future.

  2. Benchmarks for FastEmbed: I’ll run some initial benchmarks comparing FastEmbed against an optimized SentenceTransformer, focusing on single-instance encoding to get a pure performance view. FastEmbed does seem broadly applicable, but I’ll review any specific limitations (e.g., model restrictions or API specifics) that might impact compatibility with LateChunking.

  3. Flexible Input Support: Allowing SemanticChunkers to access BaseEmbeddings or simply initialize with a string identifier sounds practical. This will provide flexibility for on-the-fly instantiation, especially beneficial when users dynamically switch between different embedding providers.

  4. FastEmbed as Default: We can decide on making FastEmbed the default once we verify its performance and compatibility for token encoding, as you mentioned. If it delivers the token-encoding functionality we need for LateChunking, it could be a strong candidate.

I’ll move ahead with the structure you proposed and keep you updated on benchmarks and potential limitations. Let me know if there are any other considerations I should factor in as I start on this!

bhavnicksm commented 18 hours ago

@adithya-s-k,

Yes! LGTM!

I have an initial BaseEmbedding in mind, which I plan to add in the next couple of hours - if it doesn't need any refinement.

This is roughly what it would be like:

class BaseEmbeddings(ABC):
    """Base class for all embedding implementations"""

    @abstractmethod
    def encode(self, texts: Union[str, List[str]]):
        """Encode text into embeddings"""
        raise NotImplementedError

    @abstractmethod
    def get_token_count(self, text: Union[str, List[str]]):
        """Get token count for text"""
        raise NotImplementedError

    @property
    @abstractmethod
    def dimension(self) -> int:
        """Get embedding dimension"""
        raise NotImplementedError

    @classmethod
    def is_available(cls) -> bool:
        """
        Check if this embeddings implementation is available (dependencies installed).
        Override this method to add custom dependency checks.
        """
        return True

Added get_token_count instead of a tokenizer function for now, still considering if I should go with tokenizer or token_counters, since most chunking methods beyond TokenChunker do not use the tokenizer for encoding or decoding but only to count tokens.

Plus, token_counter namespace has the added benefit of allowing the user to implement random functions of the type def custom_token_counter(text: str) -> int as a replacement for tokenizers

What are your thoughts on the BaseEmbeddings? Do you think we need to add or remove something here?

bhavnicksm commented 15 hours ago

Hey @adithya-s-k!

Added BaseEmbeddings in #24, have a look!