codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.91k stars 465 forks source link

allow codespell to differentiate between British and American spelling #103

Open luzpaz opened 7 years ago

luzpaz commented 7 years ago

Some projects decide to use British spelling over American. Can there be a flag to choose one of them exclusively?

lucasdemarchi commented 7 years ago

Basically it would mean to use a different dictionary, or to annotate the current one with markers for british/american corrections. Then add a flag to the tool to tell what "dialect" you want. This doesn't exist right now in codespell but could be doable.

luzpaz commented 7 years ago

we can start a list for british word list in this thread and then when the time comes create logic mentioned in https://github.com/lucasdemarchi/codespell/issues/103#issuecomment-290755564 What would be the best way to go about this? Creating a gist? or should we just start a new file in the repo?

~artefact->artifact~
~artefacts->artifacts~
~behaviour (behavior)~
~cancellation (cancelation)~
~cancelling (canceling)~
~cancelled (canceled)~
~capitalise (capitalize)~
~catalogue (catalog)~ ~centimetre (centimeter)~
~centralise (centralize)~
~centre (center)~
~colour (color)~
~colours (colors)~
~digitise (digitize)~
~digitising (digitizing)~
~flavour (flavor)~
~flavours (flavors)~
~initialisation (initialization)~
~initialise (initialize)~
~initialised (initialized)~
~initialises (initializes)~
~initialising (initializing)~
~labelled (labeled)~
~labelling (labeling)~ ~licence (license)~ ~licenced (licensed)~ ~minimise (minimize)~ ~minimising (minimizing)~ ~parametrise (parametrize)~
~prioritise (prioritize)~
~prioritising (prioritizing)~
~rasterise (rasterize)~ ~realise (realize)~ ~resizeable (resizable)~ ~specialise (specialize)~ ~unintialised (uninitialized)~
~utilise (utilize)~ ~writeable (writable)~

Under development
All crossed out words above have been added to the following dictionaries:

peternewman commented 6 years ago

Would something slightly more complex be more flexible, and allow for more language support.

It strikes me there are a few different cases (I'm ignoring the obvious color/colour as that has the challenge with e.g. HTML, where you can't pick, although perhaps that has to be solved using other codespell features): In British mode, you want standardized->standardised In American mode, you want standardised->standardized standarddi[sz]ed and any other genuine misspellings wants to map to the appropriate localised version.

In terms of the functionality already in codespell (from my brief understanding), it feels like a lot of this could be done if it could handle multiple dictionary files. Indeed looking at https://github.com/lucasdemarchi/codespell/blob/master/codespell_lib/_codespell.py#L304 it's using a dict, so just loading multiple files in order should allow this.

Theory as follows, remove standarddi[sz]ed from the main dict, store standarddised->standardised and standarddized->standardised and standardized->standardised in BrE and the zed version in AmE. Use some magic to generate a generic English one too, which merges both versions (but doesn't include standardized->standardised and standardised->standardized I guess when there's one fix, and the fix is in the other dict as a reverse map).

People who aren't bothered load the base file (without standarddi[sz]ed), and the generated common one, so standarddised and standarddized would both suggest the s and z options (for bonus points, prioritise the most likely one (based on if the misspelling had an s or a z). People who want a specific flavour load the base (optionally the common), but crucially their flavour last, which overrides so standarddised and standarddized both correct to standardised.

It might make sense to have a cleverer format for the localisation file in the case of AmE/BrE, to generate both of them, but being able to extend dictionaries would mean if you've got a mixed language codebase (e.g. first Google hit https://wiki.documentfoundation.org/Development/EasyHacks/Translation_Of_Comments ), or you have some domain specific terms it thinks are spelling errors, or are often misspelt, you could override them.

@luzpaz for your list, almost anything with a z in it should be an s. Also favour/favor.

luzpaz commented 6 years ago

Scratchpad for ideas: Rudimentary pre-test a document by running some greps. For example:

grep initialise | wc -l
grep initialize | wc -l

Python regex pattern (Run test)

(re)?initialis(ed|er|es|e|ing|ation)`
(re)?initializ(ed|er|es|e|ing|ation)
luzpaz commented 6 years ago

https://en.wikipedia.org/wiki/American_and_British_English_spelling_differences#Doubled_consonants

EdwardBetts commented 6 years ago

This might be useful: https://github.com/vlajos/misspell-fixer/blob/master/dict/misspell-fixer-gb-to-us.dict

arm-in commented 5 years ago

https://github.com/codespell-project/codespell/pull/1110

luzpaz commented 5 years ago

More variants: http://www.future-perfect.co.uk/grammar-tip/is-it-targetted-or-targeted/

kierun commented 4 years ago

As a work around, you can use -L minimise or --ignore-words=FILE options. This is especially useful in conjuncture with pre-commit:

-   repo: https://github.com/codespell-project/codespell
    rev: v1.16.0
    hooks:
    -   id: codespell
        args: [-L, "minimise"]
luzpaz commented 4 years ago

proveable->provable

peternewman commented 4 years ago

These can all go into https://github.com/codespell-project/codespell/blob/master/codespell_lib/data/dictionary_en-GB_to_en-US.txt if anyone is feeling keen. Being sure to move them out of dictionary.txt if they're currently in there.

peternewman commented 4 years ago

Core bits done in #1480, but some more complicated stuff left to be done in a separate PR.