Open zuphilip opened 4 years ago
Keep it simple. I think it would be sufficient to have a user option (similar to the tesseract path option) for the language / script which is preset to eng
(the default language which is always installed). The user would be responsible for installing and selecting the right models, otherwise Tesseract would simply fail with an error.
Latin
(or script/Latin
, depending on your installation) is a good choice for all texts based on Latin script. Some users might need Cyrillic
, Greek
, Arabic
or other scripts. The user option would also allow setting Latin+Greek+Arabic
, for example, so I see no need to ask each time.
Regarding 1.
For Unix-systems it would probably be enough to just run the command
tesseract --list-langs > /path/to/file.txt
to print all the available languages to a file.
If this works fine, one could implement a Dropdownmenu to just select the language. I think that would be enough.
A simple solution in a free textbox in the new preferences as @stweil suggested is now implemented.
I am aware of the command in tesseract to show all available languages, but I don't see a possibility to call this from Zotero and save its output somewhere. But yeah we could create a file with something like this.
Let us wait a little bit more and in practice how good the simple solution is already working.
Have had a related problem: not being accustomed to type "deu" but always "de" in similar cases (...which I should have verified by trying "tesseract list-lang" of course...) took me quite a long time to get the solution - also because the system doesn't throw any error messages in that case (sadly!). A dropdown-box (or simply: more examples!) would have helped a lot!
Currently, we use a fixed language as
deu
oreng
for OCR with Tesseract. But in a lot of cases it is even better to choosescript/Latin
, or for old textsscript/Fraktur
. Also other languages or scripts should be available to choose from.There are several things to consider here:
tesseract --list-langs
from the extension, but we cannot access the output or pipe the output somewhere from Zotero. Should we just ship a one-liner script (shell script for linux/mac and bat file for windows) which is then calling the command above and pipe it to a file, which we then can analyze? Other ideas?deu
model for German texts andeng
model for English texts. However, this might not always be that simple. For example for older German texts one should maybe usescript/Fraktur
model instead and even thescript/Latin
model is quite often better for texts including names also in foreign languages etc.CC @stweil @luerhard