crate-ci / typos

Source code spell checker
Apache License 2.0
2.7k stars 113 forks source link

typos-dict: Allow `muc` #1149

Closed cbachhuber closed 2 weeks ago

cbachhuber commented 3 weeks ago

Thank you for this awesome tool!

Related to https://github.com/crate-ci/typos/pull/1148: In my opinion, muc should not be auto-corrected to much. MUC is the IATA-code of the Munich airport and a widely used shorthand for Munich.

What's your opinion?

epage commented 3 weeks ago

In the past, we accidentally let words be corrected that shouldn't, leading to allowed.csv. Seeing enough english words go through that, we added english.csv.

What I'm wondering is what should be our bar for IATA codes being included and if we should mass import them. Or, if these specialized enough that people should handle this in their config so that everyone else can benefit from the corrections that mass-importing IATA codes would cause.

cbachhuber commented 3 weeks ago

That's a great question. How did you historically decide such questions? Did you rely on word count databases such as google books ngram? Maybe we should also check how much overlap there actually is with the typos dictionary? There are 11,300 IATA codes currently assigned.

Intuitively, I lean towards not blanketly importing all IATA codes, but only select, widely used ones such as JFK, HKG, or FRA (and of course MUC, where I'm from 😉).

epage commented 3 weeks ago

I want to step back and double check the actual underlying ask.

How much of this is about the IATA code vs a shorthand for Munich? Those are different use cases with different needs.

Also, why is this coming up in the code? In my own experience in programming, I rarely mention locations and never need to mention airports or cities enough to use shorthands like that.

How did you historically decide such questions?

For cases like this, unfortunately, gut feel. There are the problems I've had or see that give me a feel for some concepts being used in code (base64, uuids, programming terms, etc). When it comes to more specialized domains, the question is how specialized and thats where I need input from you all to better understand. If its too specialized, then I figure we'd be doing more harm than good. If we included every potential valid collection of letters (company names, iata, given and surnames),. then there won't be much left to correct (e.g. we use the surname teh as an example in our docs).

cbachhuber commented 3 weeks ago

How much of this is about the IATA code vs a shorthand for Munich?

Good point! I only used to IATA to give a proper reference for MUC, so IATA codes are really secondary here. In our proprietary project, we use muc in configs to point to files recorded in Munich that have muc in the filename.

I'll follow your gut judgement with this. I'm absolutely ok if you say that the use case I present is too niche 👍

epage commented 3 weeks ago

Would help to have more context on why the location is used. Is it inherent to the problem or more of a logging aid? Are many other locations used and what is the practice around notating them?

cbachhuber commented 3 weeks ago

It is inherent to our company setup, but not essential: we're developing lidar perception and arbitration software. As part of our development process, we create sensor recordings of sample drives to be used in reprocessing or as input to KPI pipelines. Naturally, we record these close to our offices, which are, among others, located in Munich, Orlando, and San Francisco. The recording filenames typically contain high-level recording context such as sensor setup, date, and location (MUC, MCO, SFO).

So this is just for anyone working with the files to quickly get a rough understanding what that file contains without needing to dive into the file metadata.

epage commented 2 weeks ago

Thanks for clarifying! That seems specialized enough that without much more interest, we'll pass for now, preferring people handle this in their config files.