codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.88k stars 468 forks source link

Eventually codespell will need to break up dictionary.txt in to separate files #1275

Open luzpaz opened 5 years ago

luzpaz commented 5 years ago

Noticed how trying to view gelma's dictionary #1244 GH won't let me view the file via the UI and and requests I use git to view it locally. Also at around 2.5MB, a dictionary.txt file starts to slow down Atom and Gitkraken (GUI git cliet (view or diffing).

We aren't at that stage yet but we should prepare our thinking for it.

My proposal is to use the repology-rules model where each letter is it's own separate file.

This obviously makes contributing a PITA, so thinking further (just spit-balling here) if we can program some sort of:

  1. local script that gets bundled in to codespell that helps users add new words to their dictionary files say in preparation for a PR
  2. and/or a git bot that listens to dictionary.txt PRs; parse out the additions/removals alphabetically and then sorts/adds them in to the appropriate files
peternewman commented 4 years ago

We've done some of this due to the multi-dictionary stuff, although not significantly for the main one.

lurch commented 4 years ago

Sounds like this also touches on #1361

lurch commented 4 years ago

The main dictionary.txt is already too big for https://github.com/codespell-project/codespell/blame/master/codespell_lib/data/dictionary.txt to work properly ;-) (although splitting it up wouldn't actually help with the 'blame', because the history would then only go as far back as the file-splitting.)

sebweb3r commented 4 years ago

I'm in favor of sorting/splitting the dictionaries too.

An additional problem is the enUS vs. enGB #1468. At the moment, one has to run codespell twice to get a somehow proper fixed file. First run it for misspelled words (that potentially are fixed to enGB) and run it again to convert the fixes to enUS.

sebweb3r commented 4 years ago

Maybe it would be good to split codespell into repositories for code and dictionaries.

lurch commented 4 years ago

Maybe it would be good to split codespell into repositories for code and dictionaries.

Yeah, I guess the commit-history for the code would be much easier to read if it wasn't also peppered with dictionary updates :slightly_smiling_face: