This pull request introduces several updates to improve the flexibility and robustness of the chunking system by supporting multiple tokenizer backends and refining the import mechanisms for external libraries. The most significant changes include adding a dynamic tokenizer loading mechanism, updating the initialization of various chunkers to accept different tokenizer types, and restructuring the import logic for external libraries like spaCy and sentence-transformers.
Tokenizer Support Enhancements:
Added _load_tokenizer method to dynamically load tokenizers from different libraries (tiktoken, autotiktokenizer, tokenizers, transformers) based on availability. (src/chonkie/chunker/base.py) [1][2][3]
Introduced _decode and _decode_batch methods to handle token decoding for different backends. (src/chonkie/chunker/base.py)
Chunker Initialization Updates:
Updated SDPMChunker, SemanticChunker, SentenceChunker, and TokenChunker to accept Union[str, Any] for the tokenizer parameter, allowing for more flexible tokenizer initialization. (src/chonkie/chunker/sdpm.py) [1] (src/chonkie/chunker/semantic.py) [2][3] (src/chonkie/chunker/sentence.py) [4][5] (src/chonkie/chunker/token.py) [6][7]
External Library Import Improvements:
Refactored the import logic for spaCy and sentence-transformers to be more dynamic and only import when necessary, improving startup time and handling import errors gracefully. (src/chonkie/chunker/semantic.py) [1][2] (src/chonkie/chunker/sentence.py) [3][4]
Minor Adjustments:
Added space to words in _get_word_list_token_counts to ensure consistent token splitting. (src/chonkie/chunker/word.py)
This pull request introduces several updates to improve the flexibility and robustness of the chunking system by supporting multiple tokenizer backends and refining the import mechanisms for external libraries. The most significant changes include adding a dynamic tokenizer loading mechanism, updating the initialization of various chunkers to accept different tokenizer types, and restructuring the import logic for external libraries like
spaCy
andsentence-transformers
.Tokenizer Support Enhancements:
_load_tokenizer
method to dynamically load tokenizers from different libraries (tiktoken
,autotiktokenizer
,tokenizers
,transformers
) based on availability. (src/chonkie/chunker/base.py
) [1] [2] [3]_decode
and_decode_batch
methods to handle token decoding for different backends. (src/chonkie/chunker/base.py
)Chunker Initialization Updates:
SDPMChunker
,SemanticChunker
,SentenceChunker
, andTokenChunker
to acceptUnion[str, Any]
for the tokenizer parameter, allowing for more flexible tokenizer initialization. (src/chonkie/chunker/sdpm.py
) [1] (src/chonkie/chunker/semantic.py
) [2] [3] (src/chonkie/chunker/sentence.py
) [4] [5] (src/chonkie/chunker/token.py
) [6] [7]External Library Import Improvements:
spaCy
andsentence-transformers
to be more dynamic and only import when necessary, improving startup time and handling import errors gracefully. (src/chonkie/chunker/semantic.py
) [1] [2] (src/chonkie/chunker/sentence.py
) [3] [4]Minor Adjustments:
_get_word_list_token_counts
to ensure consistent token splitting. (src/chonkie/chunker/word.py
)