Open andrewgazelka opened 4 weeks ago
Can we have more context in this PR from a user-perspective? I see that this has to do with the minhash kernel.
I don't believe it's a regression in functionality. Originally, the behavior was like this. It doesn't affect correctness per se, as it's still correct, but it could affect expected behavior since people might expect multiple spaces to be treated as one. I don't think this is necessarily urgent unless someone has an issue with it.
Description
Currently, the
WindowedWords
iterator doesn't properly handle text with multiple consecutive spaces between words. The functionality is intentionally disabled (commented out) in the test suite for performance reasons.Current Behavior
memchr::memchr_iter(b' ', text.as_bytes())
to find space charactersDesired Behavior
As shown in the commented test case:
The iterator should: