OCR-D / spec

Specification of the @OCR-D technical architecture, interface definitions and data exchange format(s)
https://ocr-d.de/en/spec/
17 stars 5 forks source link

support dewarping #180

Open bertsky opened 3 years ago

bertsky commented 3 years ago

This is somewhat already part of #116 but I would like to see a discussion for the specific problem that dewarping poses to the coordinate reproducibility principle.

Now that we have actual promising tools that we could wrap for page-level dewarping, like blitzDrt for perspective correction and Origami's dewarper for parametric grid morphing, we should provide a solution how to integrate this in OCR-D.

To represent the coordinate system after dewarping the page, we could rely on PAGE-XML's dewarping schema (DwGts for short). It references the original image under /DwGts/DocumentImage/@filename and describes the morphing grid under /DwGts/Grid (with Row[*]/@points against Row[*]/@index with Row[*]/@refLinePos and Column[*]/@index with Column[*]/@refLinePos). (Unfortunately, it comes with very little documentation and no examples.)

But this is a separate XML file not referenced by the PAGE-XML content schema (PcGts). So for dewarping steps, the output fileGrp would need to be comprised of 3 files per page:

  1. the output (dewarped) image
  2. the output PcGts annotation, referencing 1. under /PcGts/Page/@imageFilename instead of the original/input image, and transforming all existing coordinates of the input PcGts
  3. the output DwGts annotation, referencing the original/input image under /DwGts/DocumentImage/@filename

So any later processing step will only "see" the dewarped image and use its coordinate system. Whenever we want to transform back, we'll have to take the current PcGts, look up the earlier DwGts, and create a new PcGts by replacing the /PcGts/Page/@imageFilename with /DwGts/DocumentImage/@filename and inverse transforming all coordinates according to /DwGts/Grid. This could be at the final ingest, or some intermediate step.

Potential problems:

kba commented 3 years ago

Thanks for summarizing the problem and opening this discussion.

I will have to think more about this and ideally, we should also discuss this with @chris1010010. But as to the potential problems you raise:

bertsky commented 3 years ago
  • mets:transformFile is probably the most METS-compliant mechanism

I'm not so sure about that. It comes with an obligatory @TRANSFORMTYPE restricted to either decompression or decryption. We could ignore the usual semantics of that, but it's probably not so great for compliance.

On the other hand, using mets:GROUPID for an arbitrary identifier shared by the original and derived page-level image would meet the intent of the METS spec and allow us to easily find any associated images. (We could even use that for AlternativeImage dependency tracking across fileGrps in general. But it has only set semantics, whereas map semantics would be better for our directed dependency graph.)

  • since we rely on the pc:AlternativeImage/@comments mechanism extensively already, we should focus on that. We'd need a way to distinguish the reversible/coordinate-stable dewarping to be implemented from non-reversible legacy dewarping.

So far we rely on that mechanism only to indicate which coordinate transforms described in PcGts actually apply to an AlternativeImage, so we can track its coordinate system w.r.t. /Page/@imageFilename. But we don't need to do that (procedurally) when we allow replacing the latter, because the coordinate system will already be the same (the dewarping will already be "pre-applied").

That point was more about the workspace/METS than the processor/PAGE side: There should be a fast and reliable way of identifying any changes of the original image across the workflow chain, without the need to search through all pages and PAGEs. I'm not a METS expert, there are so many ways to represent that. We just need something that does not break any existing use-cases, is not too contrived and efficiently implementable. (And we should still allow for the possibility of not being able to track the coordinate system but nevertheless mark the change as such, so implementations like anybaseocr-dewarp can at least fit in.)

There's of course an alternative to replacing the original image and using DwGts: We could also facilitate PcGts-only dewarping with some representation in @custom as descriptive means for the coordinate transform. Here the need for a strict usage of the @comments mechanism and the issue of "stable" (i.e. with @custom) vs "legacy" (without @custom) does arise. (This would also help with line-level dewarping, which we cannot represent with DwGts at all). But then I am rather in favour of extending PcGts with some /PcGts/Page/Grid upstream.