Closed peterdekker closed 5 years ago
The current solution is indeed rather patchy and not sufficient, sometimes 'empty' hocr files get fed that won't produce a FoLiA file. It looks as if nextflow produces an empty output file in that case which is obviously not valid FoLiA and FoLiA-correct stumbled on it. I'll do an extra check weeding out those zero-byte files before FoLiA-correct (still not very elegant though).
Should be solved in v0.8.0
I ran the PICCL workflow for a number of images of pages from a book (356411-356419.tif from CirculaireBriefFranseNatie). For pages without text, no Folia is created in the OCR step. This file is then missed in subsequent steps, ultimately leading to a segmentation fault.
See here for the
error.log
: https://pastebin.ubuntu.com/p/kK2Q9YFPJy/I think that ideally, OCR output should be generated for empty pages as well. Alternatively, subsequent steps should be able to work with missing files.
@JessedeDoes