Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.35k stars 167 forks source link

Add support for handling multiple spaces between words in WindowedWords iterator #3133

Open andrewgazelka opened 4 weeks ago

andrewgazelka commented 4 weeks ago

Description

Currently, the WindowedWords iterator doesn't properly handle text with multiple consecutive spaces between words. The functionality is intentionally disabled (commented out) in the test suite for performance reasons.

Current Behavior

Desired Behavior

As shown in the commented test case:

#[test]
fn test_multiple_spaces_between_words() {
    let s = "Hello    world  from  Rust";
    let result: Vec<&str> = s.windowed_words(2).collect();
    assert_eq!(result, vec!["Hello    world", "world  from", "from  Rust"]);
}

The iterator should:

  1. Preserve all spaces between words in the output slices
  2. Treat consecutive spaces as a single word boundary
  3. Maintain the exact spacing in the output windows
jaychia commented 2 weeks ago

Can we have more context in this PR from a user-perspective? I see that this has to do with the minhash kernel.

andrewgazelka commented 2 weeks ago

I don't believe it's a regression in functionality. Originally, the behavior was like this. It doesn't affect correctness per se, as it's still correct, but it could affect expected behavior since people might expect multiple spaces to be treated as one. I don't think this is necessarily urgent unless someone has an issue with it.