charlesw / tesseract

A .Net wrapper for tesseract-ocr
Apache License 2.0
2.26k stars 742 forks source link

Validation of words #312

Open jasmina94 opened 7 years ago

jasmina94 commented 7 years ago

Hi everyone! I'm doing some project about word recognition based on Tesseract engine. So far, everything work fine. Idea is to have some kind of snipping tool which will give user possibility to make image containing text. After processing image- text is recognized, but now I want to validate it somehow. Is there a way to validate text that engine returns? I know there is Levenstein distance algorithm, but I don't know how to use it. I would compare word by word which Tesseract returns but don't know with what to compare it. Please help me if you have any ideas for solving my problem.

Thank you :)

Sicos1977 commented 7 years ago

What do you mean with validating the words? If they are spelled correctly or if it are real words?

jasmina94 commented 7 years ago

If spelling is correct. But if there is something for checking if it is a real word, it will also be helpful. @Sicos1977

Sicos1977 commented 7 years ago

You can use nhunspell for that --> https://www.nuget.org/packages/NHunspell/

jasmina94 commented 7 years ago

Thank you

ermGit commented 7 years ago

From:  Eric

Checkout Lucene.Net (.net implementation of Lucene) for fuzzy logic searching.  Then perhaps compare the result against a dictionary file that has been preloaded into Lucene.Net.

There are other fuzzy logic services which you can use too like Elastic search.  But you may not want to incur the time penalty of using API calls.  If your program is in java or .net you can include Lucene library directly into your program.

Eric

Sent from my Galaxy S®III

-------- Original message --------
From: Jasmina
Date:02/11/2017 8:13 AM (GMT-06:00)
To: charlesw/tesseract
Cc: Subscribed
Subject: [charlesw/tesseract] Validation of words (#312)
Hi everyone! I'm doing some project about word recognition based on Tesseract engine. So far, everything work fine. Idea is to have some kind of snipping tool which will give user possibility to make image containing text. After processing image- text is recognized, but now I want to validate it somehow. Is there a way to validate text that engine returns? I know there is Levenstein distance algorithm, but I don't know how to use it. I would compare word by word which Tesseract returns but don't know with what to compare it. Please help me if you have any ideas for solving my problem. Thank you :) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.