cisocrgroup / ocrd_cis

OCR-D python tools
MIT License
33 stars 12 forks source link

no correction with ocrd-cis-postcorrect #51

Open EEngl52 opened 4 years ago

EEngl52 commented 4 years ago

I'm running ocrd-cis-postcorrect on the aligned OCR-output of Calamari and Tesserocr. So far, the output seems to be completely identical with the input even though there are quite some differences between the results of the two OCR engines. See e.g. the attached example. postcorrect.zip

How can I achieve some correction results?

finkf commented 4 years ago

Thanks for reporting. I am having a look.

finkf commented 4 years ago

It appears that both files are line-segmented. The post-correction needs word-segmented input. Anyway you could try to set the OCR to output word segments (as well as line segments).

EEngl52 commented 4 years ago

thanks for your quick reply! I'll try it again with word segments and report back

EEngl52 commented 4 years ago

I finally tried ocrd-cis-postcorrect again, this time with two OCR results from Tesseract and Calamari boeth segmented on word level (and aligned beforehand). Unfortunately I now run into an error (see attachment), there are no output files produced at all.

stderr.txt

finkf commented 4 years ago

From a quick glance I suspect problems with the profiling. Can you rerun the same command with --log-level DEBUG? I'll take a closer look later today.

EEngl52 commented 4 years ago

thx a lot for your quick reply! there's the log file

stderr.txt

finkf commented 4 years ago

In order to run our post correction, both our profiler and an according language backend has to be installed on the system. The configuration variable profilerPath (which should be named profilerCommand more appropriately) must point to the profiler executable and the profilerConfig variable must point to the according language configuration file. There is a manual for the profiler and the language backend in our repositories.

The other way is to use the profiler that is installed in this project's Dockerfile using docker. You can execute the following steps to build and test the docker container:

$ cd path/to/ocrd_cis                            # Change into ocrd_cis directory.
$ sudo docker build -t ocrd_cis .                # Build the ocrd_cis docker image (this will take some time).
$ sudo docker run ocrd_cis /apps/profiler --help # Check the profiler command in the image.
$ echo 'Theyle' | sudo docker -i run ocrd_cis /apps/profiler \
  --config /etc/profiler/languages/german.ini \
  --sourceFormat TXT --sourceFile /dev/stdin --simpleOutput

Then you can write a shellscript that executes sudo docker -i run ocrd_cis /apps/profiler $@, set the profilerPath to this script and the profilerConfig to e.g. /etc/profiler/languages/german.ini (a language configuration file within the docker container).

The third option is to run the post correction directly from the built docker image. I see that these points are not very clear in the documenation for the post correction. I will improve the documentation to make the configuration of the profiler more clear.

finkf commented 4 years ago

And I forgot to mention, that the error you are getting is due to a bad profiler configuration.

EEngl52 commented 4 years ago

thanks for your help! I'm using a native installation of ocrd_all and assumed that it included everything I need to run ocrd-cis-postcorrect (except the model). But then I guess I still need to install profiler and language backend

finkf commented 4 years ago

If you use a native installation, you need to install the profiler as well. I have little experience with python's installation setup. But maybe it is possible to install the profiler alongside with ocrd_cis. Maybe @kba can help here.