case_sensitive=True gives unexpected results

nschloe commented 2 years ago

In a case-sensitive dictionary, I would expect 'FBI' to be known and 'fbi' to be unknown. However, both cases give me 'fbi' as known:

from spellchecker import SpellChecker

spell = SpellChecker(case_sensitive=True)

print(spell.known(["FBI"]))
print(spell.known(["fbi"]))

{'fbi'}
{'fbi'}

akhmerov commented 2 years ago

That's because the language is set (en by default), and case_sensitive is ignored if language is set (as per the docstring).

nschloe commented 2 years ago

Thanks for the reply! Is there way to get

{'FBI'}
{}

from the above code at all? (If language=None, both seem to be ignored.)

barrust commented 2 years ago

That is likely because you didn't add a dictionary. What dictionary did you add?

Can you try something like this? This should work, I am not at a location to run it myself to verify no typos!

from spellchecker import SpellChecker

spell = SpellChecker(language=None, case_sensitive=True) 
spell.word_frequency.add("FBI") 

print("FBI" in spell)
print("fbi" in spell)

nschloe commented 2 years ago

Ah wait, when using language=None, it actually only spellchecks words that I put in manually? That's not good enough for me. Is there no way to use a case-sensitive English dictionary?

barrust commented 2 years ago

To use a case_sensitive dictinoary, you will need to build it yourself as the default dictionaries are case-insensitive. There are lots of ways to build dictionaries, and they are not manually. I only used that to ensure that there wasn't a bug. You can find the different ways to build a custom dictionary in the documentation on building a new dictionary or in the GitHub Discussion #90.

Either way, there are reasons why the default dictionaries are not capitalized:

Reduce the number of characters to calculate all the differences
Reduces the number of words to check (The vs. the) since the same word may be capitalized, say due to being the first word in the sentence.
The library does not take into account the type of word (entity, verb, adverb, etc) and thus cannot determine if the word being checked should be capitalized or not.

Just some thoughts on it; good luck!

nschloe commented 2 years ago

Thanks for the info!

barrust / pyspellchecker

case_sensitive=True gives unexpected results #122