UB-Mannheim / zotero-ocr

Zotero Plugin for OCR
GNU Affero General Public License v3.0
532 stars 37 forks source link

Determine available languages and provide a choice for them #8

Open zuphilip opened 4 years ago

zuphilip commented 4 years ago

Currently, we use a fixed language as deu or eng for OCR with Tesseract. But in a lot of cases it is even better to choose script/Latin, or for old texts script/Fraktur. Also other languages or scripts should be available to choose from.

There are several things to consider here:

  1. How can we find out the available languages for the currently installed tesseract? - It is possible to run commands like tesseract --list-langs from the extension, but we cannot access the output or pipe the output somewhere from Zotero. Should we just ship a one-liner script (shell script for linux/mac and bat file for windows) which is then calling the command above and pipe it to a file, which we then can analyze? Other ideas?
  2. It is possible to have some general options and defining a standard model there. In the setting pane you can then also change this model depending on the languages you have installed (see 1.).
  3. It is possible to analyze the language field of each Zotero entry to choose a different option. This would then allow for example to use deu model for German texts and eng model for English texts. However, this might not always be that simple. For example for older German texts one should maybe use script/Fraktur model instead and even the script/Latin model is quite often better for texts including names also in foreign languages etc.
  4. Maybe it is better to ask before each call which language to choose etc. Then you can manually select all the entries which can be recognized by the same language. Moreover, one could possible have some more Tesseract options to toggle on/off etc. What do you think?

CC @stweil @luerhard

stweil commented 4 years ago

Keep it simple. I think it would be sufficient to have a user option (similar to the tesseract path option) for the language / script which is preset to eng (the default language which is always installed). The user would be responsible for installing and selecting the right models, otherwise Tesseract would simply fail with an error.

Latin (or script/Latin, depending on your installation) is a good choice for all texts based on Latin script. Some users might need Cyrillic, Greek, Arabic or other scripts. The user option would also allow setting Latin+Greek+Arabic, for example, so I see no need to ask each time.

luerhard commented 4 years ago

Regarding 1. For Unix-systems it would probably be enough to just run the command tesseract --list-langs > /path/to/file.txt to print all the available languages to a file.

If this works fine, one could implement a Dropdownmenu to just select the language. I think that would be enough.

zuphilip commented 4 years ago

A simple solution in a free textbox in the new preferences as @stweil suggested is now implemented.

I am aware of the command in tesseract to show all available languages, but I don't see a possibility to call this from Zotero and save its output somewhere. But yeah we could create a file with something like this.

Let us wait a little bit more and in practice how good the simple solution is already working.

zettelberg commented 3 months ago

Have had a related problem: not being accustomed to type "deu" but always "de" in similar cases (...which I should have verified by trying "tesseract list-lang" of course...) took me quite a long time to get the solution - also because the system doesn't throw any error messages in that case (sadly!). A dropdown-box (or simply: more examples!) would have helped a lot!