This pull request includes significant updates to the chonkie package, primarily focusing on removing the dependency on tokenizers and enhancing the chunking and embeddings functionalities. The most important changes include the removal of the tokenizer from the chunkers, the addition of a new base embeddings class, and updates to the documentation and tests to reflect these changes.
Removal of Tokenizer Dependency:
src/chonkie/chunker/sdpm.py: Removed the tokenizer parameter from the SDPMChunker class and its initialization. Updated the class docstring and method arguments accordingly. [1][2]
src/chonkie/chunker/semantic.py: Removed the tokenizer parameter from the SemanticChunker class and its initialization. Updated the class docstring and method arguments accordingly. [1][2][3][4]
src/chonkie/embeddings/base.py: Introduced the BaseEmbeddings abstract base class to standardize the embedding functionality across different embedding implementations.
Documentation Updates:
DOCS.md: Updated the documentation to reflect the removal of the tokenizer parameter from the chunkers and added details about the max_chunk_size parameter for the SemanticChunker. [1][2][3]
Configuration Changes:
pyproject.toml: Updated the package configuration to include the new chonkie.embeddings module.
These changes streamline the chunking process by removing unnecessary dependencies and introduce a new abstraction for embeddings, making the codebase more modular and easier to maintain.
This pull request includes significant updates to the
chonkie
package, primarily focusing on removing the dependency on tokenizers and enhancing the chunking and embeddings functionalities. The most important changes include the removal of the tokenizer from the chunkers, the addition of a new base embeddings class, and updates to the documentation and tests to reflect these changes.Removal of Tokenizer Dependency:
src/chonkie/chunker/sdpm.py
: Removed thetokenizer
parameter from theSDPMChunker
class and its initialization. Updated the class docstring and method arguments accordingly. [1] [2]src/chonkie/chunker/semantic.py
: Removed thetokenizer
parameter from theSemanticChunker
class and its initialization. Updated the class docstring and method arguments accordingly. [1] [2] [3] [4]tests/chunker/test_sdpm_chunker.py
: Removed thetokenizer
fixture and updated the tests to reflect the removal of the tokenizer parameter. [1] [2] [3] [4] [5]tests/chunker/test_semantic_chunker.py
: Removed thetokenizer
fixture and updated the tests to reflect the removal of the tokenizer parameter. [1] [2] [3]Enhancements to Embeddings:
src/chonkie/embeddings/__init__.py
: Added theBaseEmbeddings
class to the module exports.src/chonkie/embeddings/base.py
: Introduced theBaseEmbeddings
abstract base class to standardize the embedding functionality across different embedding implementations.Documentation Updates:
DOCS.md
: Updated the documentation to reflect the removal of the tokenizer parameter from the chunkers and added details about themax_chunk_size
parameter for theSemanticChunker
. [1] [2] [3]Configuration Changes:
pyproject.toml
: Updated the package configuration to include the newchonkie.embeddings
module.These changes streamline the chunking process by removing unnecessary dependencies and introduce a new abstraction for embeddings, making the codebase more modular and easier to maintain.