codespell-project / codespell

check code for common misspellings
GNU General Public License v2.0
1.88k stars 469 forks source link

Validate typos in the dictionary against a dictionary of valid words #1140

Open peternewman opened 5 years ago

peternewman commented 5 years ago

We'll need to find a list of valid words from somewhere, but this keeps happening to varying degrees of detectability, e.g. https://github.com/codespell-project/codespell/pull/1014#discussion_r288251043

peternewman commented 5 years ago

In my scripts, before to insert a couple "a->b" in my list, I check them against the usual historycal dictionaries you can find on Unix (like /usr/share/dict/american-english). But of course, there are a lot of specific (and not so specific...) words not included.

_Originally posted by @Gelma in https://github.com/codespell-project/codespell/pull/1014#discussion_r288406064_

sebweb3r commented 4 years ago

If you have aspell installed, you can dump the aspell wordbook. aspell -d en dump master | aspell -l en expand > words.en.txt

I checked (the probably outdated version of debian stable) with codespell. It results in

words.en.txt:6357: Aline ==> Align
words.en.txt:12212: Ines ==> Lines
words.en.txt:18891: OD ==> OF
words.en.txt:18895: Oder ==> Order, odor
words.en.txt:18899: OT ==> TO, OF, OR
words.en.txt:21359: Thur ==> Their
words.en.txt:21631: Thant ==> Than
words.en.txt:22099: BA ==> BY, BE
words.en.txt:22100: Ba ==> By, be
words.en.txt:26781: Bridget ==> Bridged
words.en.txt:37392: Handel ==> Handle
words.en.txt:43699: Claus ==> Clause
words.en.txt:49978: Capetown ==> Cape town
words.en.txt:58942: Leary ==> Leery
words.en.txt:59428: LSAT ==> LAST
words.en.txt:60627: Muhammadan ==> Muslim
words.en.txt:60629: Mohammedans ==> Muslims
words.en.txt:67337: Noe ==> Not, no, node, know, now
words.en.txt:69667: ND ==> AND, 2ND
words.en.txt:69668: Nd ==> And, 2nd
words.en.txt:69671: Ned ==> Need
words.en.txt:90954: Somme ==> Some
words.en.txt:95817: Sade ==> Sad
words.en.txt:99166: Te ==> The, be
words.en.txt:103710: Donn ==> Done, don
words.en.txt:108316: Tuscon ==> Tucson
words.en.txt:112621: Weill ==> Will
words.en.txt:114694: Waring ==> Warning
words.en.txt:117705: Chanel ==> Channel
words.en.txt:124114: Parana ==> Piranha
sebweb3r commented 4 years ago

hunspell and en_GB results in unmunch /usr/share/hunspell/en_GB.dic /usr/share/hunspell/en_GB.aff > words.en.hunspell.GB.txt

algebraical ==> algebraic
alls ==> all, falls
Alway ==> Always
amened ==> amended, amend
anonyms ==> anonymous
Appling ==> Applying, appalling
arbitral ==> arbitraryrecommanded
Aske ==> Ask
aspected ==> expected
Asser ==> Assert
ba ==> by, be
BA ==> BY, BE
Ba ==> By, be
Bacup ==> Backup
BEng ==> being
Berkley ==> Berkeley
bion ==> bio
BrE ==> be, brie
Bridget ==> Bridged
brose ==> browse, rose
cacheing ==> caching
caesarian ==> caesarean
calender ==> calendar
calenders ==> calendars
Cann ==> Can
cannister ==> canister
cannisters ==> canisters
canonicalizations ==> canonicalization
Chanel ==> Channel
charas ==> chars
Claus ==> Clause
co-ordinate ==> coordinate
co-ordinates ==> coordinates
Commerical ==> Commercial
complier ==> compiler
compliers ==> compilers
connexion ==> connection
contiguities ==> continuities
convertor ==> converter
convertors ==> converters
Corse ==> Course
Cound ==> Could, count
decompresser ==> decompressor
delink ==> unlink
Delting ==> Deleting
delusionally ==> delusively
demographical ==> demographic
Depden ==> Depend
despatch ==> dispatch
dessicate ==> desiccate
dessication ==> desiccation
dessicated ==> desiccated
digitalise ==> digitize
digitalising ==> digitizing
digitalize ==> digitize
digitalizing ==> digitizing
discernable ==> discernible
drats ==> drafts
earlies ==> earliest
easer ==> easier, eraser
Ede ==> Edge
Effient ==> Efficient
equipments ==> equipment
extraversion ==> extroversion
extravert ==> extrovert
extraverts ==> extroverts
fightings ==> fighting
fightings ==> fighting
Flagg ==> Flag
floatation ==> flotation
focussed ==> focused
refocussed ==> refocused
focussed ==> focused
focusses ==> focuses
informations ==> information
formate ==> format
formates ==> formats
informations ==> information
Frome ==> From
funguses ==> fungi
Gardai ==> Gardaí
geometrician ==> geometer
Guatamala ==> Guatemala
Hald ==> Held
hander ==> handler
Handel ==> Handle
happing ==> happening, happen
Harth ==> Hearth
heathy ==> healthy
heigh ==> height, high
homogenous ==> homogeneous
Humber ==> Number
incidently ==> incidentally
incudes ==> includes
infectuous ==> infectious
intension ==> intention
internation ==> international
interpolar ==> interpolator
interpretor ==> interpreter
invokable ==> invocable
keypair ==> key pair
keypairs ==> key pairs
keyserver ==> key server
keyservers ==> key servers
Leary ==> Leery
leat ==> lead, leak, least, leaf
leats ==> least
Mabe ==> Maybe
Mata ==> Meta, mater
meeds ==> needs
miniscule ==> minuscule
Monserrat ==> Montserrat
Muhammadan ==> Muslim
commutating ==> commuting
commutated ==> commuted
Nd ==> And, 2nd
Ned ==> Need
ned ==> need
Noth ==> North
OD ==> OF
Oder ==> Order, odor
ons ==> owns
OT ==> TO, OF, OR
overrideable ==> overridable
patten ==> pattern, patent
pattens ==> patterns, patents
Pattens ==> Patterns, patents
penality ==> penalty
Pennal ==> Panel
compliers ==> compilers
Poer ==> Power
Pont ==> Point
Ponting ==> Pointing
pre-empt ==> preempt
precent ==> percent, prescient
quitted ==> quit
raison ==> reason, raisin
readapted ==> re-adapted
recommand ==> recommend
recommanded ==> recommended
recommands ==> recommends
reoccurrence ==> recurrence
revaluated ==> reevaluated
scaleability ==> scalability
scaleable ==> scalable
setted ==> set
Skelton ==> Skeleton
Smoot ==> Smooth
Somme ==> Some
Sowe ==> Sow, so we
squirl ==> squirrel
Stoer ==> Store
targetting ==> targeting
targetted ==> targeted
Te ==> The, be
Tey ==> They
Thant ==> Than
this'd ==> this would
Thur ==> Their
Toi ==> To, toy
trigged ==> triggered
Tring ==> Trying, string, ring
Troup ==> Troupe
unmistakeably ==> unmistakably
Varian ==> Variant
Vermillion ==> Vermilion
Wass ==> Was
Wil ==> Will, well
Winn ==> Win
Worser ==> Worse
worthing ==> worth, meriting
Worthing ==> Worth, meriting
peternewman commented 4 years ago

Thanks for this @sebweb3r . As you may have seen we got some core stuff in via #1142 . I'm not quite sure what my "other examples need checking" comment meant with regards not closing this issue then.

I'm a bit unclear which way your checks have been done. Is this Codespell run against aspell and hunspell's dictionaries?

Also

-seledted->sekected
+seledted->selected

Originally posted by @peternewman in https://github.com/codespell-project/codespell/pull/1619

sebweb3r commented 4 years ago

Sorry for not being precise. I've dumped the aspell and hunspell dictionaries. Then, I've checked the dumbs with codespell.

So all of these lines are words, that are "wrong" in codespell, but exist in aspell or hunspell.

But I'm not sure, if one wants to delete all of the corrections.

sebweb3r commented 4 years ago

I haven't seen #1142 yet, but I will have a closer look.

lurch commented 4 years ago

Going the other way around, and running aspell against codespell's correct-words (generated with cat dictionary.txt | cut -d'>' -f2 | sort | uniq > codespell_corrections.txt) also suggests:

(And possibly many more? I got bored of checking... :wink: (as there are many entries in codespell that aspell doesn't recognise, but a Google search suggests are still spelled correctly) )

If codespell is going to suggest corrections, those corrections ought to be spelled correctly :grinning:

sebweb3r commented 4 years ago

@lurch that's why I never let spellcheckers automatically fix the errors. I basically checked the correct spellings in #1624 already (against aspell-enUS dict) ;-) I added some of your suggestions.

acknowledgment depends on enUS or enGB #1623 (One of the physics journals insists on using the variant without e. But they have both spellings on their introductions webpage :-) )

lurch commented 4 years ago

acknowledgment depends on enUS or enGB

Ooops, I didn't realise that it had multiple spellings (like color / colour), sorry!

I added some of your suggestions.

Cool :+1:

peternewman commented 4 years ago

So I added some checking, but we need #1485 to have a larger dictionary and fewer false positives, or we need to split the main dictionary and rare into corrections that are in the dictionary and those that aren't, so we can prioritise more carefully checking the non-dictionary words. Currently it doesn't check the corrections as lots of valid technical terms aren't in the aspell word list.

DimitriPapadopoulos commented 1 year ago

@peternewman Words not in aspell dictionary can be added after https://github.com/codespell-project/codespell/pull/2933. Such words need to be whitelisted because some specialised words will be missing from the aspell or other dictionaries, no matter how large the dictionary is. Can we close this issue?