Using ochre to evaluate synthetic ocr post processing dataset generation

omrishsu commented 6 years ago

Hi, I’m working on a method to synthetically generating an ocr post processing dataset. I think that ochre could be a great project for benchmark different datasets and evaluate which is better. The evaluation method that I was thinking about is creating one evaluation dataset and several synthetic datasets – then, train ochre’s model on each dataset and correcting the evaluation dataset (that is very different) and see which will be able to correct better (based on cer and wer metrics).

Here is one of my datasets (random errors based on some texts from Gutenberg): https://drive.google.com/open?id=1TUd3M7StziFibGGLbpSth_wb1ZfE2DmI And here is the evaluation dataset (a clean version of the ICDAR 2017 dataset): https://drive.google.com/open?id=1zyIKlErr_Aho5UQgTXzJukRZCcZX2MiY (13 files)

My problem is that I’m not so much a python developer (more java developer) and I’m not familiar with CWL. I was wondering if you plan to provide more documentation and how to for this project? And if you can add this scenario to your examples?

Thanks! Omri

jvdzwaan commented 6 years ago

Sorry for my late reply, I also have been busy with other projects. I think ochre could be useful for you. It uses an existing tool to calculate wer and cer: https://github.com/impactcentre/ocrevalUAtion with one small change, the default limit of 10,000 characters is removed (so by default it only calculates wer and cer for files containing a maximum of 10,000 characters).

If you just want to try the ocrevalUAtion tool, have a look at https://hub.docker.com/r/nlppln/ocrevaluation-docker/

After installing docker, you can run it with:

docker run -i --rm -v=/path/to/data/:/data/ nlppln/ocrevaluation-docker java -cp /ocrevalUAtion/target/ocrevaluation.jar eu.digitisation.Main -o /data/out.html -gt /data/gs/gs-file.txt -ocr /data/ocr/ocr-file.txt

This assumes that in /path/to/data you have two folders, one called gs containing the gold standard files and one called ocr containing the ocr files. The result is that there will be a new file in /path/to/data called out.html containing (amongst other things) the wer and cer.

The workflow lets you run the tool for a directory of files, extracts the wer and cer, and puts them in a csv-file. (I see that I haven't committed the ocrevaluation workflow just yet.)

I am planning to add the workflow and more documentation, but don't know exactly when I'll have time.

Your project sounds cool and useful for the work I am doing. I'll try to update the documentation soon!

jvdzwaan commented 6 years ago

So, I have updated the documentation and added the workflows for calculating performance. You don't need a lot of Python knowledge, just follow the installation instructions and adjust the paths in the cwltool commands.

Let me know if you run into problems!

omrishsu commented 6 years ago

Thanks a lot! Now I understand the folder structure, and I've updated my code to work with ochre's structure. I'll try the workflows as soon as possible and update on my progress.

Thanks again Omri

KBNLresearch / ochre

Using ochre to evaluate synthetic ocr post processing dataset generation #4