ellenhp / airmail

Lightweight geocoder in pure Rust
https://airmail.rs/
Apache License 2.0
292 stars 3 forks source link

Make better use of spoken language data in WhosOnFirst #12

Open ellenhp opened 5 months ago

ellenhp commented 5 months ago

At a bare minimum, spoken language data should inform the dictionary choice used for generating all the abbreviation permutations in airmail_indexer.

I also want to find a way to use it to correctly stem languages. Once focus point queries are supported (currently we only have bounding box queries) we can lookup into WOF the spoken languages in the focus point and surrounding areas and use stemmers for those languages. Doing this will involve splitting out the fields we use by language. Currently there's only one field, "content", but eventually we'll need more for handling matches that need to get boosted. Outside the scope of this issue, but those boosted fields may need a version for each language also. I'm thinking we can use lingua-rs to pick the top 5 possible languages for every query, and then search against those fields in a disjunction, using stemmers as appropriate?

There will be a performance cost to this of course, but the lack of stemmers is really disappointing because with lenient mode off (no prefix queries allowed) I can't search for "mighty-o donut" if the POI is called "mighty-o donuts". When I briefly had stemming working on a feature branch it was so cool to watch things like "tow truck" match "XYZ towing company". That's the kind of thing that I think airmail needs to really stand out, even if it has to be disabled for remote indexes.