OSMLatvija / Osmalyzer

Parsing OSM data in Latvia against various data sources
https://osmlatvija.github.io/Osmalyzer/
GNU General Public License v3.0
2 stars 1 forks source link

Common misspellings #43

Open HellMapGoesCoding opened 4 weeks ago

HellMapGoesCoding commented 4 weeks ago

Latvian language-specific known misspellings and unlikely spellings/words/terms. Biggest problem is coming up with a list. Would need to manually go through all the names or slowly build up over time. Need to consider false positives and exceptions.

richlv commented 4 weeks ago

List is likely not the biggest problem, as it can be built up over time, adding anything that is noticed. I actually had such a list somewhere locally, and used to manually check on a periodic basis... But finding it would likely take more time than creating a new one.

markalex2209 commented 4 weeks ago

List of typos: https://lv.wikipedia.org/wiki/Vikiprojekts:Vikip%C4%93dijas_uzlabo%C5%A1ana/Raksti/Typo

Spellchecker dictionary used by Firefox is based on this: https://dict.dv.lv/documentation.php?prj=lv

I think we can try and use one or both of them.

HellMapGoesCoding commented 4 weeks ago

So are we/you proposing to run against known incorrect values (blacklist) or against all known correct values (spellcheck - whitelist)? I don't want to speculate before even trying it out, but I suspect it will have either way too many false positives with whitelist or way too few actionable true positives with blacklist. This is why I was pessimistic with this task and calling producing an actual "list" the "biggest problem". I also imagine names on a map likely have somewhat different occurrences than prose text on Wikipedia. But hey, I'd like to be proven wrong. Thanks for looking into at least, I might download those and at least try it and see what the results are.

markalex2209 commented 3 weeks ago

I think both approaches would be (or at least might be) useful.

Check against dictionary get us a quick reference for general spelling. It probably will result in a lot of false positives, but I hope that it will be manageable. One possible caveat with this approach: words with upper case first letter might be ignored or have lesser checks, at least as I understand from browser's spellcheck behaviour.

Check against known incorrect words would allow to have a more narrow list of words that are not expected. Also, this might provide a possibility to check for words that are existing and spelled correctly, but not expected to be on map, like obscenities or cases of commonly confused words.

richlv commented 3 weeks ago

I suspect Wikipedia's typo list might not be a good fit for OSM, like "paronoja -> paranoja" :) Personally, I'd go with simple blacklist, building it over time. Although if somebody is willing to do whitelist and clean it up, that would be much more powerful.

Spellcheck dictionary wouldn't account for shop names, brand names etc, but extending it with such could be a great long-term goal.

Which tags were meant to be covered here? I guess name, brand, operator, inscription...

HellMapGoesCoding commented 3 weeks ago

As an experiment, I put all names through a spellchecker (i.e. whitelist) with predictable results of "There are 13035 unknown-spelling values from 27861 (out of 56202) elements" https://osmlatvija.github.io/Osmalyzer/Spelling%20report.html So it's currently about half of all the names. I have some ideas, but make your own conclusions for now ;)

markalex2209 commented 3 weeks ago

Took a peek at results. One thing that immediately stood out are transborder objects, that have name in multiple languages, like Vadakstis / Vadakste for example. Those should be excluded from analysis I think. Would it be reasonable to simply filter out by presence of / (space + slash + space)?

HellMapGoesCoding commented 3 weeks ago

Took a peek at results. One thing that immediately stood out are transborder objects, that have name in multiple languages, like Vadakstis / Vadakste for example. Those should be excluded from analysis I think. Would it be reasonable to simply filter out by presence of / (space + slash + space)?

Yes, this was something I also noticed. I was working on splitting slashes, just hadn't committed because I kept finding new ways for things to break. Notably, there are valid reasons to have a / in the name and to not split it. And I cannot rely on space-delimited / either because not all names do that and I'm not about to go editing every object like that (it seems accepted from examples I checked around the world).

I committed my changes now and it splits the name into parts and checks each individually, not that it knows that they aren't all Latvian.

Ideally, OSM object would have name=Dingus / Дингус and also name:lv=Dingus and name:ru=Дингус, so that I can match those to parts and then ignore non-checkable parts.

markalex2209 commented 3 weeks ago

I meant to ignore those altogether, because part in different language will (always?) cause a false positive. But we might even have a better way out: if object has name:lv - only check it, and if not, default back to name. And let's leave correlation between name and name:lv for another day*


* : partially might be implemented in the transliteration check, but only for roads.