This pull request includes several changes to improve the flexibility and functionality of the chunking and embedding models in the chonkie package. The most important changes include updating the BaseChunker class to support token counters, modifying the SemanticChunker to use the new embedding model interface, and updating the tests to reflect these changes.
Enhancements to BaseChunker:
src/chonkie/chunker/base.py: Updated the BaseChunker class to accept a callable tokenizer or token counter, added methods to count tokens and batch count tokens, and adjusted the initialization logic to handle different types of tokenizers. [1][2][3][4]
Improvements to SemanticChunker:
src/chonkie/chunker/semantic.py: Modified the SemanticChunker to use the new BaseEmbeddings interface, removed redundant import statements, and updated the initialization to use AutoEmbeddings for loading embedding models. [1][2][3][4]
Updates to embedding models:
src/chonkie/embeddings/auto.py: Enhanced the AutoEmbeddings class to support different types of embedding models and updated the get_embeddings method to handle various model types. [1][2][3]
src/chonkie/embeddings/base.py: Added a method to get the tokenizer or token counter and implemented cosine similarity in the BaseEmbeddings class. [1][2][3]
tests/chunker/test_sdpm_chunker.py and tests/chunker/test_semantic_chunker.py: Updated tests to use SentenceTransformerEmbeddings instead of SentenceTransformer. [1][2][3][4]
This pull request includes several changes to improve the flexibility and functionality of the chunking and embedding models in the
chonkie
package. The most important changes include updating theBaseChunker
class to support token counters, modifying theSemanticChunker
to use the new embedding model interface, and updating the tests to reflect these changes.Enhancements to
BaseChunker
:src/chonkie/chunker/base.py
: Updated theBaseChunker
class to accept a callable tokenizer or token counter, added methods to count tokens and batch count tokens, and adjusted the initialization logic to handle different types of tokenizers. [1] [2] [3] [4]Improvements to
SemanticChunker
:src/chonkie/chunker/semantic.py
: Modified theSemanticChunker
to use the newBaseEmbeddings
interface, removed redundant import statements, and updated the initialization to useAutoEmbeddings
for loading embedding models. [1] [2] [3] [4]Updates to embedding models:
src/chonkie/embeddings/auto.py
: Enhanced theAutoEmbeddings
class to support different types of embedding models and updated theget_embeddings
method to handle various model types. [1] [2] [3]src/chonkie/embeddings/base.py
: Added a method to get the tokenizer or token counter and implemented cosine similarity in theBaseEmbeddings
class. [1] [2] [3]src/chonkie/embeddings/sentence_transformer.py
: Implemented theget_tokenizer_or_token_counter
method in theSentenceTransformerEmbeddings
class.Test updates:
tests/chunker/test_sdpm_chunker.py
andtests/chunker/test_semantic_chunker.py
: Updated tests to useSentenceTransformerEmbeddings
instead ofSentenceTransformer
. [1] [2] [3] [4]