code-kern-ai / bricks

Open-source natural language enrichments at your fingertips.
Apache License 2.0
447 stars 23 forks source link

[MODULE] - OCR autocorrection #270

Open jhoetter opened 1 year ago

jhoetter commented 1 year ago

Please describe the module you would like to add to bricks Turn "t3st docunnent" into "test document"

Do you already have an implementation? No, but statistical distribution could be relevant for this. Or by looking up if the word occurs in a vocabulary (e.g. via nltk, ...), and if not, searching for the word with the smallest Levenshtein distance.

Additional context Tried this many years ago already, and there are simple but effective approaches for this.

LeonardPuettmannKern commented 1 year ago

Interesting idea. Do you already have some resources on how to tackle this?

jhoetter commented 1 year ago

Only what I've written above :D

rasdani commented 1 year ago

pyspellchecker implements this idea.

Just skimmed these, seems to use only algos no models.

EDIT: Uses Levenshtein distance in fact.

jhoetter commented 1 year ago

Nice, that is great. I don't know yet if pyspellchecker supports different languages like e.g. German, but we can use this for English at least I guess. For German, Swedish, French, ... I think an approach using the language vocabulary would be helpful