KBNLresearch / ochre

Toolbox for OCR post-correction
Apache License 2.0
122 stars 18 forks source link

print error - ICDAR2017_shared_task_workflows.ipynb #16

Open thiagopx opened 4 years ago

thiagopx commented 4 years ago

Hi guys,

I suggest change print wf.list_steps() to print (wf.list_steps()) in the notebook ICDAR2017_shared_task_workflows.ipynb

Also, I would not able to run cwltool ochre/cwl/ICDAR2017_shared_task_workflows. That is what I have got: ochre/cwl/vudnc-preprocess-pack.cwl: error: argument --archive is required

jvdzwaan commented 4 years ago

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive
thiagopx commented 4 years ago

Thanks! The signature of wf.list_steps() changed, so, yes, you should do print(wf.list_steps()).

Please note that the workflow is about preprocessing the vudnc data, this has nothing to do with the icdar 2017 shared task. Also, I do not recommend using the vudnc data, because it is very noisy. But if you do want to preprocess it anyway, you should do

cwltool ochre/cwl/vudnc-preprocess-pack.cwl --archive path/to/vudnc/archive

You are correct. I meant that I was not able to run vudnc-preprocess-pack.cwl.

For good results in english, do you recommend using the english monograph partition of ICDAR? I trained with both monograph and the periodical partitions in separated but the validation accuracy and loss were not good (and also the tests I made).

I would like to help with some additional documentation to improve reproducibility, but I need a roadmap of how to get significant results (mainly for english documents).

jvdzwaan commented 4 years ago

Unfortunately, ochre is not (yet) fit for training good ocr post-correction models. I plan to work on it in the future, but only as a hobby project. So no promises there!

Generally speaking, the OCR post-correction datasets are small. That's why I'm making a list of them, so they can be used for generalization. I don't think that training on the English monograph data will give you a model that will work on other data, because OCR errors tend to depend on time period, font, the ocr software that was used, etc.