Abraxas-365 / langchain-rust

🦜️🔗LangChain for Rust, the easiest way to write LLM-based programs in Rust
MIT License
491 stars 63 forks source link

fix(deps): update rust crate text-splitter to 0.14 #173

Closed renovate[bot] closed 2 months ago

renovate[bot] commented 2 months ago

Mend Renovate

This PR contains the following updates:

Package Type Update Change
text-splitter dependencies minor 0.13 -> 0.14

Release Notes

benbrandt/text-splitter (text-splitter) ### [`v0.14.0`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0140) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.13.3...v0.14.0) ##### What's New **Performance fixes for large documents.** The worst-case performance for certain documents was abysmal, leading to documents [that ran forever](https://togithub.com/benbrandt/text-splitter/issues/184). This release makes sure that in the worst case, the splitter won't be binary searching over the entire document, which it was before. This is prohibitively expensive especially for the tokenizer implementations, and now this should always have a safe upper bound to the search space. For the "happy path", this new approach also led to big speed gains in the `CodeSplitter` (50%+ speed increase in some cases), marginal regressions in the `MarkdownSplitter`, and not much difference in the `TextSplitter`. But overall, the performance should be more consistent across documents, since it wasn't uncommon for a document with certain formatting to hit the worst-case scenario previously. ##### Breaking Changes - Chunk output may be slightly different because of the changes to the search optimizations. The previous optimization occasionally caused the splitter to stop too soon. For most cases, you may see no difference. It was most pronounced in the `MarkdownSplitter` at very small sizes, and any splitter using `RustTokenizers` because of its offset behavior. ##### Rust - `ChunkSize` has been removed. This was a holdover from a previous internal optimization, which turned out to not be very accurate anyway. - This makes implementing a custom `ChunkSizer` much easier, as you now only need to generate the size of the chunk as a `usize`. It often required in tokenization implementations to do more work to calculate the size as well, which is no longer necessary. ##### Before ```rust pub trait ChunkSizer { // Required method fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize; } ``` ##### After ```rust pub trait ChunkSizer { // Required method fn size(&self, chunk: &str) -> usize; } ```

Configuration

📅 Schedule: Branch creation - "after 1am every 3 weeks on Saturday" in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.



This PR has been generated by Mend Renovate. View repository job log here.

benbrandt commented 2 months ago

@Abraxas-365 there is a new CodeSplitter available now if you wanted to integrate it

prabirshrestha commented 2 months ago

Merged. Thanks