BurntSushi / fst

Represent large sets and maps compactly with finite state transducers.
The Unlicense
1.76k stars 123 forks source link

Arabic support #160

Open ghost opened 1 year ago

ghost commented 1 year ago

Hey, I'm trying to use fst with arabic, with levenshtein, and it seems to not work at all... The code I'm working with:

use fst::{automaton::Levenshtein, IntoStreamer, Map};

fn main() {
    let mut data = vec![("مصر", 1)];
    data.sort();

    // A convenient way to create sets in memory.
    let map = Map::from_iter(data)
        .unwrap();
    // Build our fuzzy query.
    let lev = Levenshtein::new("مصر", 2).unwrap();

    // Apply our fuzzy query to the set we built.
    let stream = map.search(lev).into_stream();
    let keys = stream.into_stream().into_str_keys().unwrap();
    println!("keys: {keys:#?}");
}

I've enabled the utf8-ranges and levenshtein features on the crate.

ghost commented 1 year ago

Looking at #38 it seems the library has issues with utf8 searching as a whole ☹️

BurntSushi commented 1 year ago

38 is a tough bug to follow, and I'm not sure there is even a solid reproduction there yet? In any case, generalizing from "the levenshtein implementation has some bugs in places" to the entire library seems pretty egregious to me. The vast majority of this library doesn't care about UTF-8 at all. It's just bytes. Levenshtein tries to do the somewhat-correct thing of measuring edit distance in terms of the number of codepoints. There may indeed be bugs there. I personally do not have any time in the short term to look into those bugs. The Levenshtein code was something I whipped together several years ago, and it has not seen much love since then.

But for this one, it looks like it's a nice small reproduction. Thank you for that. I can indeed reproduce it.

Folks are welcome to take a dive into the Levenshtein automata code and see if they can fix it. PRs are welcome. #38 appears to have some useful suggestions.