LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

ocr.nf cannot find FoLiA-hocr output files #30

Closed peterdekker closed 6 years ago

peterdekker commented 6 years ago

When running ocr.nf, the expected file format outputted by FoLiA-hocr is the basename of the original file + *.folia.xml: https://github.com/LanguageMachines/PICCL/blob/master/ocr.nf#L229

However, since the fix for issue https://github.com/LanguageMachines/foliautils/issues/21, "id-" is prepended to filenames starting with a number by FoLiA-hocr: https://github.com/LanguageMachines/foliautils/commit/6af7fa48c3fbe40540145dda6229f5f8878efaec Now ocr.nf cannot find the files outputted by FoLiA-hocr anymore.

Not all files get the "id-" prefix from FoLiA-hocr, only the ones starting with a number. So a solution could be to make ocr.nf look for a broader output pattern. Or maybe FoLiA-hocr should add the "id-" prefix to all files, and ocr.nf could always look for this prefix.

@kosloot @proycon

proycon commented 6 years ago

Ouch, this is indeed a regression introduced by that fix.

@kosloot We indeed need a very consistent predictable naming from FoLiA-hocr otherwise PICCL doesn't know what to look for.

kosloot commented 6 years ago

I am not sure that it is up to FoLiA-hocr to fix all the PICCl/nextflow issues :p, but OK more on this here: https://github.com/LanguageMachines/foliautils/issues/21