benbrandt/text-splitter (text-splitter)
### [`v0.14.0`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0140)
[Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.13.3...v0.14.0)
##### What's New
**Performance fixes for large documents.** The worst-case performance for certain documents was abysmal, leading to documents [that ran forever](https://togithub.com/benbrandt/text-splitter/issues/184). This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space.
For the "happy path", this new approach also led to big speed gains in the `CodeSplitter` (50%+ speed increase in some cases), marginal regressions in the `MarkdownSplitter`, and not much difference in the `TextSplitter`. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously.
##### Breaking Changes
- Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the `MarkdownSplitter` at very small sizes, and any splitter using `RustTokenizers` because of its offset behavior.
##### Rust
- `ChunkSize` has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway.
- This makes implementing a custom `ChunkSizer` much easier, as you now only need to generate the size of the chunk as a `usize`. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary.
##### Before
```rust
pub trait ChunkSizer {
// Required method
fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize;
}
```
##### After
```rust
pub trait ChunkSizer {
// Required method
fn size(&self, chunk: &str) -> usize;
}
```
Configuration
📅 Schedule: Branch creation - "after 1am every 3 weeks on Saturday" in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
[ ] If you want to rebase/retry this PR, check this box
This PR has been generated by Mend Renovate. View repository job log here.
This PR contains the following updates:
0.13
->0.14
Release Notes
benbrandt/text-splitter (text-splitter)
### [`v0.14.0`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0140) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.13.3...v0.14.0) ##### What's New **Performance fixes for large documents.** The worst-case performance for certain documents was abysmal, leading to documents [that ran forever](https://togithub.com/benbrandt/text-splitter/issues/184). This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space. For the "happy path", this new approach also led to big speed gains in the `CodeSplitter` (50%+ speed increase in some cases), marginal regressions in the `MarkdownSplitter`, and not much difference in the `TextSplitter`. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously. ##### Breaking Changes - Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the `MarkdownSplitter` at very small sizes, and any splitter using `RustTokenizers` because of its offset behavior. ##### Rust - `ChunkSize` has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway. - This makes implementing a custom `ChunkSizer` much easier, as you now only need to generate the size of the chunk as a `usize`. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary. ##### Before ```rust pub trait ChunkSizer { // Required method fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize; } ``` ##### After ```rust pub trait ChunkSizer { // Required method fn size(&self, chunk: &str) -> usize; } ```Configuration
📅 Schedule: Branch creation - "after 1am every 3 weeks on Saturday" in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR has been generated by Mend Renovate. View repository job log here.