Open luzpaz opened 7 years ago
Basically it would mean to use a different dictionary, or to annotate the current one with markers for british/american corrections. Then add a flag to the tool to tell what "dialect" you want. This doesn't exist right now in codespell but could be doable.
we can start a list for british word list in this thread and then when the time comes create logic mentioned in https://github.com/lucasdemarchi/codespell/issues/103#issuecomment-290755564 What would be the best way to go about this? Creating a gist? or should we just start a new file in the repo?
~artefact->artifact~
~artefacts->artifacts~
~behaviour (behavior)~
~cancellation (cancelation)~
~cancelling (canceling)~
~cancelled (canceled)~
~capitalise (capitalize)~
~catalogue (catalog)~
~centimetre (centimeter)~
~centralise (centralize)~
~centre (center)~
~colour (color)~
~colours (colors)~
~digitise (digitize)~
~digitising (digitizing)~
~flavour (flavor)~
~flavours (flavors)~
~initialisation (initialization)~
~initialise (initialize)~
~initialised (initialized)~
~initialises (initializes)~
~initialising (initializing)~
~labelled (labeled)~
~labelling (labeling)~
~licence (license)~
~licenced (licensed)~
~minimise (minimize)~
~minimising (minimizing)~
~parametrise (parametrize)~
~prioritise (prioritize)~
~prioritising (prioritizing)~
~rasterise (rasterize)~
~realise (realize)~
~resizeable (resizable)~
~specialise (specialize)~
~unintialised (uninitialized)~
~utilise (utilize)~
~writeable (writable)~
Under development
All crossed out words above have been added to the following dictionaries:
Would something slightly more complex be more flexible, and allow for more language support.
It strikes me there are a few different cases (I'm ignoring the obvious color/colour as that has the challenge with e.g. HTML, where you can't pick, although perhaps that has to be solved using other codespell features): In British mode, you want standardized->standardised In American mode, you want standardised->standardized standarddi[sz]ed and any other genuine misspellings wants to map to the appropriate localised version.
In terms of the functionality already in codespell (from my brief understanding), it feels like a lot of this could be done if it could handle multiple dictionary files. Indeed looking at https://github.com/lucasdemarchi/codespell/blob/master/codespell_lib/_codespell.py#L304 it's using a dict, so just loading multiple files in order should allow this.
Theory as follows, remove standarddi[sz]ed from the main dict, store standarddised->standardised and standarddized->standardised and standardized->standardised in BrE and the zed version in AmE. Use some magic to generate a generic English one too, which merges both versions (but doesn't include standardized->standardised and standardised->standardized I guess when there's one fix, and the fix is in the other dict as a reverse map).
People who aren't bothered load the base file (without standarddi[sz]ed), and the generated common one, so standarddised and standarddized would both suggest the s and z options (for bonus points, prioritise the most likely one (based on if the misspelling had an s or a z). People who want a specific flavour load the base (optionally the common), but crucially their flavour last, which overrides so standarddised and standarddized both correct to standardised.
It might make sense to have a cleverer format for the localisation file in the case of AmE/BrE, to generate both of them, but being able to extend dictionaries would mean if you've got a mixed language codebase (e.g. first Google hit https://wiki.documentfoundation.org/Development/EasyHacks/Translation_Of_Comments ), or you have some domain specific terms it thinks are spelling errors, or are often misspelt, you could override them.
@luzpaz for your list, almost anything with a z in it should be an s. Also favour/favor.
Scratchpad for ideas:
Rudimentary pre-test a document by running some grep
s. For example:
grep initialise | wc -l
grep initialize | wc -l
Python regex pattern (Run test)
(re)?initialis(ed|er|es|e|ing|ation)`
(re)?initializ(ed|er|es|e|ing|ation)
This might be useful: https://github.com/vlajos/misspell-fixer/blob/master/dict/misspell-fixer-gb-to-us.dict
As a work around, you can use -L minimise
or --ignore-words=FILE
options. This is especially useful in conjuncture with pre-commit
:
- repo: https://github.com/codespell-project/codespell
rev: v1.16.0
hooks:
- id: codespell
args: [-L, "minimise"]
proveable->provable
These can all go into https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_en-GB_to_en-US.txt if anyone is feeling keen. Being sure to move them out of dictionary.txt if they're currently in there.
Core bits done in #1480, but some more complicated stuff left to be done in a separate PR.
Some projects decide to use British spelling over American. Can there be a flag to choose one of them exclusively?