KBNLresearch / ochre

Toolbox for OCR post-correction
Apache License 2.0
122 stars 18 forks source link

Permanent failure with VU recepie #12

Open hadiasheri opened 6 years ago

hadiasheri commented 6 years ago

Hi, I'm trying to run the code with VU DNC dataset. The link you provided didn't work and I downloaded it from here. Now, when I run the vudnc-preprocess.cwl as follows:

in_dir="/home/dataset/VU/FoLiACMDI" ocr_dir_name="/home/dataset/VU/Preprocess/ocr" gs_dir_name="/home/dataset/VU/Preprocess/gs" aligned_dir_name="/home/dataset/VU/Preprocess/aligned" tmp_dir="/home/ochre/vu-tmp/" tmp_dir_out="/home/ochre/vu-tmp-out/" cachedir="/home/ochre/cachedir/" align_m="align_m.csv" align_c="align_c.csv" ocr_n="ocr_n.csv" gs_n="gs_n.csv"

cwltool |cwl-runner ochre/cwl/vudnc-preprocess.cwl --in_dir $in_dir --ocr_dir_name $ocr_dir_name --gs_dir_name $gs_dir_name --aligned_dir_name $aligned_dir_name --ocr_n $ocr_n --gs_n $gs_n --align_m $align_m --align_c $align_c

Howerver, it is permanently failed with the following message:

[step merge-json] Cannot make job: Value for file:///home/ochre/ochre/cwl/align-texts-wf.cwl#merge-json/in_files not specified

[workflow align-texts-wf] completed permanentFail

I'd be grateful if you could help to figure out the problem. Thanks H

jvdzwaan commented 6 years ago

I think the workflow fails because of changes to nlppln. I'll try to see if I can fix that later.

Alo, I really recommend to use a different dataset than the vudnc corpus. It is just to noisy. Here is a poster that shows the most common error is a hyphenation error ('- ' that should be replaced with '', that is just too easy): https://doi.org/10.5281/zenodo.1189245

jvdzwaan commented 6 years ago

Okay, it should work again. Be careful to read the updated documentation in the README. Also, don't forget to update nlppln.

For future reference, this is the relevant commit: 9ee6d7cca72bb9bcd074e1843b12ceea122662ce