biolab / orange3-text

🍊 :page_facing_up: Text Mining add-on for Orange3
Other
127 stars 84 forks source link

Bag of Words stores occurrences of 'hci' in corpus as 'hcus' in sparse data (or is lemmatizer to blame?) #1041

Closed wvdvegte closed 8 months ago

wvdvegte commented 8 months ago

Describe the bug For documents containing the string 'hci' (from human-computer interaction) in the corpus, Bag of Words changes 'hci' to 'hcus' in its sparse-matrix representation

To Reproduce See attached workflow. Dataset is shared through Google Drive link

Expected behavior "hci" should be kept as 'hci' in sparse data. Could this be some automatic conversion of latin plurals ending with '-i' to singular ending with '-us' (such as nuclei -> nucleus) caused by the lemmatizer?

Orange version: 3.36.2.

Text add-on version: 1.15.0 Screenshots If applicable, add screenshots to help explain your problem.

Operating system: Mac OS 14.3.1

Example workflow hcus bug.ows.zip

ajdapretnar commented 8 months ago

Almost 100% certain it is the fault of the lemmatizer.