filyp / autocorrect

Spelling corrector in python
GNU Lesser General Public License v3.0
447 stars 79 forks source link

Detection of an omitted space #26

Open fingoldo opened 3 years ago

fingoldo commented 3 years ago

Thanks for this wonderful lib!

Can you add some functionality to detect accidentally merged words, for example, when a whitespace (separating words apart) was omitted?

from autocorrect import Speller
spellEn = Speller('en')
[spellEn.get_candidates(lemma) for lemma in ['test','project','testproject']]

>>>[[(495684, 'test')], [(1628175, 'project')], [(0, 'testproject')]]

It would be cool if 'testproject' could produce correct candidates: 'test' and 'project' How hard is it to add such a feature?

filyp commented 3 years ago

Hi! It would complicate the logic a bit, but it's possible. This would require adding a function generating these splits in https://github.com/fsondej/autocorrect/blob/master/autocorrect/typos.py and in https://github.com/fsondej/autocorrect/blob/master/autocorrect/__init__.py assigning scores to those splits, for example as min(score_word1, score_word2).

Also, I fear that this splitting would happen too often, for example ashe -> as he instead of ashes anso -> an so instead of also This would require some calibration, for example downscoring short words, which further complicates things. Also maybe switching off double typos correction would be necessary when using these splits. I don't have time to add this feature, but I would happily merge a PR with it, if the score in tests increases.