LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

Forwarders do not show up for some file formats #53

Closed peterdekker closed 4 years ago

peterdekker commented 5 years ago

For some input file formats in the PICCL web interface (folia.xml and tif), the file forwarders do not show up. Also, the template and format columns are emtpy. Screenshot: https://imgur.com/2B1yADI For plaintext input, the forwarders do show up

proycon commented 5 years ago

Can you send me your input and error.log then I can see if I can reproduce and fix it?

peterdekker commented 5 years ago

I uploaded the tif input file and the output archive in this folder: https://ivdnt.box.com/s/kowkr9fbzngy5azs23c7drdoql12hf1f

proycon commented 5 years ago

Thanks, for the tif file (img_356416.tif) it seems some part of the process unexpectedly gets rid of the numeric suffix, so the filenames end up being different than what CLAM expected. I'll look for the culprit...

proycon commented 5 years ago

Seems to happen right at the beginning in the OCR part

proycon commented 5 years ago

Correction: this is an explicit thing in the pipeline (which I completely forgot about). So technically it's a feature instead of a bug ;) Whenever a user uploads images in the format $documentname_$sequencenr.$extension, they are automatically recombined, which is not a bad idea in itself.

The problem, however, is that recombination that happens inside the pipeline is something CLAM can not predict. CLAM needs to be able to compute the names of the expected output files a priori to really be sure how input relates to output (and what kinds of viewers/forwarders/types to associate with the output). CLAM has explicitly been designed this way.

There is no quick fix for this, but on the bright side, if the tif/jpg/png/gif input is a single file without a numeric suffix, things should work as expected. But if the sequence merging feature is triggered then CLAM can't determine what the output is with certainty, and it will just dump the output without any further associated metadata. A similar thing will happen with the PICCL feature that reassembles a PDF.

This issue also kind of resonates with a feeling that has been growing on me for a while; the level of complexity of a pipeline like PICCL, with multiple entry and exit points and many paths, does not lend itself all to well for cramming into a single CLAM service. I'm currently not too happy with the form the wrapper script has taken, as it proves fairly error prone and hard to test. (don't get me wrong, things are still many orders of magnitudes better than before this entire refactor effort)

peterdekker commented 5 years ago

Thanks for investigating, I understand! Then this is just how it works for now.