UB-Mannheim / ocrd_pagetopdf

OCR-D wrapper for prima-pagetopdf
Apache License 2.0
7 stars 5 forks source link

image input file group requirement #2

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

Thanks @JKamlah for making this great tool!

Would it be much effort to remove the requirement to have an explicit second input file group for the image? This should be just dereferenced from the /Page/@imageFilename in the PAGE file (relative to METS file path).

Also, line 35: in_grps[1]: unbound variable is not a good error message IMO.

JKamlah commented 4 years ago

Sure it is. I will think about how to implement it, because i dont want to lose the option to add processed images e.g. binarized version instead of the original images.

bertsky commented 4 years ago

In the OCR-D functional model, all PAGE annotations will always refer to the original image. Derived images are under AlternativeImage only.

You could look at /PcGts/Page/AlternativeImage/@filename for binarized/dewarped/deskewed etc images. But you have to make sure to re-calculate all coordinates then: any segment's @points always refer to the original image under /Page/@imageFilename in PAGE, but AlternativeImage can be cropped (consistent with Border), deskewed (consistent with @orientation) or even dewarped (without information).

bertsky commented 4 years ago

So maybe you can at least make the second input file group for images optional (and default to @imageFilename), also avoiding the above strange error message when missing?

JKamlah commented 4 years ago

The errormsg is alread fixed. I will implemented the optional image param soon.