fix(deps): update rust crate text-splitter to 0.13

renovate[bot] commented 3 months ago

This PR contains the following updates:

Package	Type	Update	Change
text-splitter	dependencies	minor	`0.11` -> `0.13`

Release Notes

benbrandt/text-splitter (text-splitter)

### [`v0.13.1`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0131) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.13.0...v0.13.1) Fix a bug in the fallback logic to make sure we are still respecting the maximum bytes we should be searching in. Again, this only affects Markdown splitting at very small sizes. ### [`v0.13.0`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0130) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.12.3...v0.13.0) ##### What's New / Breaking Changes **Unicode Segmentation is now only used as a fallback**. This prioritizes the semantic levels of each splitter, and only uses Unicode grapheme/word/sentence segmentation when none of the semantic levels can be split at the desired capacity. In most cases, this won't change the behavior of the splitter, and will likely mean that speed will improve because it is able to skip several semantic levels at the start, acting as a bisect or binary search, and only go back to the lower levels if it can't fit. However, for the `MarkdownSplitter` at very small sizes (i.e., less than 16 tokens), this may produce different output, becuase prior to this change, the splitter may have used Unicode sentence segmentation instead of the Markdown semantic levels, due to an optimization in the level selection. Now, the splitter will prioritize the parsed Markdown levels before it falls back to Unicode segmentation, which preserves better structure at small sizes. **So, it is likely in most cases, this is a non-breaking update**. However, if you were using extremely small chunk sizes for Markdown, the behavior is different, and I wanted to inidicate that with a major version bump. ### [`v0.12.3`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0123) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.12.2...v0.12.3) ##### Bug Fix Remove leftover `dbg!` statements in chunk overlap code [#154](https://togithub.com/benbrandt/text-splitter/pull/164) 🤦🏻‍♂️ Apologies if I spammed your logs! ### [`v0.12.2`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0122) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.12.1...v0.12.2) ##### What's New **Support for chunk overlapping:** Several of you have been waiting on this for awhile now, and I am happy to say that chunk overlapping is now available in a way that still stays true to the spirit of finding good semantic break points. When a new chunk is emitted, if chunk overlapping is enabled, the splitter will look back at the semantic sections of the current level and pull in as many as possible that fit within the overlap window. **This does mean that none can be taken**, which is often the case when close to a higher semantic level boundary. When it will almost always produce an overlap is when the current semantic level couldn't be fit into a single chunk, and it provides overlapping sections since we may not have found a good break point in the middle of the section. Which seems to be the main motivation for using chunk overlapping in the first place. ##### Rust Usage ```rust let chunk_config = ChunkConfig::new(256) // .with_sizer(sizer) // Optional tokenizer or other chunk sizer impl .with_overlap(64) .expect("Overlap must be less than desired chunk capacity"); let splitter = TextSplitter::new(chunk_config); // Or MarkdownSplitter ``` ##### Python Usage ```python splitter = TextSplitter(256, overlap=64) # or any of the class methods to use a tokenizer ``` ### [`v0.12.1`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0121) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.12.0...v0.12.1) ##### What's New - [`rust_tokenizers`](https://crates.io/crates/rust_tokenizers) support has been added to the Rust crate. ### [`v0.12.0`](https://togithub.com/benbrandt/text-splitter/blob/HEAD/CHANGELOG.md#v0120) [Compare Source](https://togithub.com/benbrandt/text-splitter/compare/v0.11.0...v0.12.0) ##### What's New This release is a big API change to pull all chunk configuration options into the same place, at initialization of the splitters. This was motivated by two things: 1. These settings are all important to deciding how to split the text for a given use case, and in practice I saw them often being set together anyway. 2. To prep the library for new features like chunk overlap, where error handling has to be introduced to make sure that invariants are kept between all of the settings. These errors should be handled as sson as possible before chunking the text. Overall, I think this has aligned the library with the usage I have seen in the wild, and pulls all of the settings for the "domain" of chunking into a single unit. ##### Breaking Changes ##### Rust - **Trimming is now enabled by default**. This brings the Rust crate in alignment with the Python package. But for every use case I saw, this was already being set to `true`, and this does logically make sense as the default behavior. - `TextSplitter` and `MarkdownSplitter` now take a `ChunkConfig` in their `::new` method - This bring the `ChunkSizer`, `ChunkCapacity` and `trim` settings into a single struct that can be instantiated with a builder-lite pattern. - `with_trim_chunks` method has been removed from `TextSplitter` and `MarkdownSplitter`. You can now set `trim` in the `ChunkConfig` struct. - `ChunkCapacity` is now a struct instead of a Trait. If you were using a custom `ChunkCapacity`, you can change your `impl` to a `From for ChunkCapacity` instead. and you should be able to still pass it in to all of the same methods. - This also means `ChunkSizer`s take a concrete type in their method instead of an impl ##### Migration Examples **Default settings:** ```rust /// Before let splitter = TextSplitter::default().with_trim_chunks(true); let chunks = splitter.chunks("your document text", 500); /// After let splitter = TextSplitter::new(500); let chunks = splitter.chunks("your document text"); ``` **Hugging Face Tokenizers:** ```rust /// Before let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap(); let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true); let chunks = splitter.chunks("your document text", 500); /// After let tokenizer = Tokenizer::from_pretrained("bert-base-cased", None).unwrap(); let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer)); let chunks = splitter.chunks("your document text"); ``` **Tiktoken:** ```rust /// Before let tokenizer = cl100k_base().unwrap(); let splitter = TextSplitter::new(tokenizer).with_trim_chunks(true); let chunks = splitter.chunks("your document text", 500); /// After let tokenizer = cl100k_base().unwrap(); let splitter = TextSplitter::new(ChunkConfig::new(500).with_sizer(tokenizer)); let chunks = splitter.chunks("your document text"); ``` **Ranges:** ```rust /// Before let splitter = TextSplitter::default().with_trim_chunks(true); let chunks = splitter.chunks("your document text", 500..2000); /// After let splitter = TextSplitter::new(500..2000); let chunks = splitter.chunks("your document text"); ``` **Markdown:** ```rust /// Before let splitter = MarkdownSplitter::default().with_trim_chunks(true); let chunks = splitter.chunks("your document text", 500); /// After let splitter = MarkdownSplitter::new(500); let chunks = splitter.chunks("your document text"); ``` **ChunkSizer impls** ```rust pub trait ChunkSizer { /// Before fn chunk_size(&self, chunk: &str, capacity: &impl ChunkCapacity) -> ChunkSize; /// After fn chunk_size(&self, chunk: &str, capacity: &ChunkCapacity) -> ChunkSize; } ``` **ChunkCapacity impls** ```rust /// Before impl ChunkCapacity for Range { fn start(&self) -> Option { Some(self.start) } fn end(&self) -> usize { self.end.saturating_sub(1).max(self.start) } } /// After impl From> for ChunkCapacity { fn from(range: Range) -> Self { ChunkCapacity::new(range.start) .with_max(range.end.saturating_sub(1).max(range.start)) .expect("invalid range") } } ``` ##### Python - Chunk `capacity` is now a required arguement in the `__init__` and classmethods of `TextSplitter` and `MarkdownSplitter` - `trim_chunks` parameter is now just `trim` in the `__init__` and classmethods of `TextSplitter` and `MarkdownSplitter` ##### Migration Examples **Default settings:** ```python ```

Configuration

📅 Schedule: Branch creation - "after 1am every 3 weeks on Saturday" in timezone America/Los_Angeles, Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Mend Renovate. View repository job log here.

benbrandt commented 3 months ago

@Abraxas-365 would you like me to tackle this one since I introduced breaking changes?

It does come with the benefit of chunk overlap being available though, and I'm not sure how you want to propagate that into the other APIs

Abraxas-365 commented 3 months ago

@benbrandt Yes, I'd appreciate your help with this, especially with the new changes and chunk overlap. Thanks for stepping in! 🙌🏽🙌🏽🙌🏽.

If there is an error with the chunking or any other error just add it to src/text_splitter/error.rs and propagate it so the one who call the function can handle it however he wants.

renovate[bot] commented 3 months ago

Edited/Blocked Notification

Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.

You can manually request rebase by checking the rebase/retry box above.

⚠️ Warning: custom changes will be lost.

Abraxas-365 / langchain-rust