Make imports as a part of Chunker __init__ instead of file imports to make Chonkie import faster

This pull request introduces several updates to improve the flexibility and robustness of the chunking system by supporting multiple tokenizer backends and refining the import mechanisms for external libraries. The most significant changes include adding a dynamic tokenizer loading mechanism, updating the initialization of various chunkers to accept different tokenizer types, and restructuring the import logic for external libraries like spaCy and sentence-transformers.

Tokenizer Support Enhancements:

Added _load_tokenizer method to dynamically load tokenizers from different libraries (tiktoken, autotiktokenizer, tokenizers, transformers) based on availability. (src/chonkie/chunker/base.py) [1] [2] [3]
Introduced _decode and _decode_batch methods to handle token decoding for different backends. (src/chonkie/chunker/base.py)

Chunker Initialization Updates:

Updated SDPMChunker, SemanticChunker, SentenceChunker, and TokenChunker to accept Union[str, Any] for the tokenizer parameter, allowing for more flexible tokenizer initialization. (src/chonkie/chunker/sdpm.py) [1] (src/chonkie/chunker/semantic.py) [2] [3] (src/chonkie/chunker/sentence.py) [4] [5] (src/chonkie/chunker/token.py) [6] [7]

External Library Import Improvements:

Refactored the import logic for spaCy and sentence-transformers to be more dynamic and only import when necessary, improving startup time and handling import errors gracefully. (src/chonkie/chunker/semantic.py) [1] [2] (src/chonkie/chunker/sentence.py) [3] [4]

Minor Adjustments:

Added space to words in _get_word_list_token_counts to ensure consistent token splitting. (src/chonkie/chunker/word.py)

bhavnicksm / chonkie

Make imports as a part of Chunker init instead of file imports to make Chonkie import faster #12

Tokenizer Support Enhancements:

Chunker Initialization Updates:

External Library Import Improvements:

Minor Adjustments:

bhavnicksm / chonkie

Make imports as a part of Chunker __init__ instead of file imports to make Chonkie import faster #12

Tokenizer Support Enhancements:

Chunker Initialization Updates:

External Library Import Improvements:

Minor Adjustments:

Make imports as a part of Chunker init instead of file imports to make Chonkie import faster #12