en-wl / wordlist

SCOWL (and friends).
http://wordlist.aspell.net
373 stars 87 forks source link

"unit" stems to "un" + "it" #359

Open klardotsh opened 1 year ago

klardotsh commented 1 year ago

Hi there! At Zulip we recently chased down a search-related issue (https://github.com/zulip/zulip/pull/23903, discussed in our public Zulip instance here) and found it to be related to how Hunspell stems the word "unit" (as un + it), in contrast to how it stems, for example, "redo" (as a full word). The expected behavior here is that "unit" is recognized as its own word. The implication in our case of this stemming to "it" is that "it" is a text search stopword in PostgreSQL - more details on how this surfaces are in the linked Zulip topic above.

Anyway, the issue seems to originate in this wordlist, as I pulled it down, built the Hunspell dictionaries as described in the README, and could repro the "unit" stemming issue:

(woods) speller  » master » hunspell -d en_US -D
SEARCH PATH:
.::/usr/share/hunspell:/usr/share/myspell:/usr/share/myspell/dicts:/usr/share/mozilla-dicts:/home/j/.config/libreoffice/4/user/wordbook:/usr/lib/libreoffice/share/wordbook
AVAILABLE DICTIONARIES (path is not mandatory for -d option):
./en_GB-ise
./en_GB-large
./en_US-large
./en_AU
./en_GB-ize
./en_US
./en_AU-large
./en_CA-large
./en_CA
LOADED DICTIONARY:
./en_US.aff
./en_US.dic

(woods) speller  » master » echo unit | hunspell -s -d en_US   
unit it

(woods) speller  » master » echo redo | hunspell -s -d en_US
redo redo