LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

OCR does not generate output for empty pages, crashes #49

Closed peterdekker closed 5 years ago

peterdekker commented 5 years ago

I ran the PICCL workflow for a number of images of pages from a book (356411-356419.tif from CirculaireBriefFranseNatie). For pages without text, no Folia is created in the OCR step. This file is then missed in subsequent steps, ultimately leading to a segmentation fault.

See here for the error.log: https://pastebin.ubuntu.com/p/kK2Q9YFPJy/

I think that ideally, OCR output should be generated for empty pages as well. Alternatively, subsequent steps should be able to work with missing files.

@JessedeDoes

proycon commented 5 years ago

The current solution is indeed rather patchy and not sufficient, sometimes 'empty' hocr files get fed that won't produce a FoLiA file. It looks as if nextflow produces an empty output file in that case which is obviously not valid FoLiA and FoLiA-correct stumbled on it. I'll do an extra check weeding out those zero-byte files before FoLiA-correct (still not very elegant though).

proycon commented 5 years ago

Should be solved in v0.8.0