OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

Matching PAGE imageFilename to mets:file when imageFilename is not a URL #176

Closed kba closed 2 years ago

kba commented 6 years ago

Scenario:

  1. Image files and PAGE referencing those image files by relative filepath:
    <Page imageFilename="foo.tif"/>
  2. Create a METS file and run workspace add:

    <mets:file GROUPID="page0001" xlink:href="file://path/to/bla/foo.tif"

Now the PAGE imageFilename and xlink:href of the corresponding mets:file do not match anymore.

kba commented 6 years ago

Solution 1) On workspace add, change the imageFilename of the PAGE.

Solution 2) Have a in place to map from @imageFilename to @xlink:href (like "match if fileName is suffix to a mets:file@xlink:href" or "match if fileName is GROUPID of a page and there is a mets:file with mimetype image/* with that GROUPID", this could be automated.)

Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files.

wrznr commented 6 years ago

Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we?

kba commented 6 years ago

See https://ocr-d.github.io/page#url-for-imagefilename--filename

The imageFilename is necessary to get from page to the mets:file that represents the image.

wrznr commented 6 years ago

Ah, okay. In this case, :+1: for solution 3.

VolkerHartmann commented 6 years ago

Solution 1 only works if workspace add is used which may be a drawback. Solution 2 sounds complex. There may be several images for the same page (orig, binarized, cropped, deskewed,...) Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step. This may be done each time an export/import is planned. My vote for solution 3.

kba commented 6 years ago

Solution 1 only works if workspace add is used which may be a drawback.

Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step.

Solution 3 reasonably also only works on workspace add since this has to be an external file in the workspace (currently I'm using url-aliases.csv). It could be populated by hand or external mechanism but then again, so could you change the PAGE by hand (or with sed).

bertsky commented 6 years ago

Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to clone -l instead?

(BTW, pack / unpack should also beware of file URLs.)

kba commented 6 years ago

what was the reason for ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?

The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors (hence the file resolver and cache etc.). Processors were to download the data by URL, do their thing, upload the data and set URL. file:// URL or relative paths should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc., in a workflow.

The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single source of truth for all data and metadata, it should always be enough to have that mets.xml and access all files via their persisten HTTP URL.

Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first- class concept. We should adapt the specs to reflect this.

bertsky commented 6 years ago

Thanks, it makes sense to me now. But what still escapes me is the logic of:

file://URLs or relative paths should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc.

I completely agree as far as file:// URLs are concerned, but relative paths? Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale? Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization (due to synchronization effort). Even with a distributed file system (which is an alternative to URLs with client-server transfer protocols) I would recommend allowing intermediate I/O to be local (temporary).

Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?)

kba commented 6 years ago

Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale?

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.

Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization

Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in /tmp because it was mounted in RAM and hence fast.

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Can I conclude from that relative file names will be your preferred solution for this issue, too?

No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the imageFilename of a PAGE-XML etc.

bertsky commented 6 years ago

In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS

Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading.

So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML to mets:fileGrp USE directory and mets:file ID filename).

But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.

Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

kba commented 6 years ago

Then the DVCS metaphor is perhaps misleading.

I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:

the same path name convention as ocrd-zip (or something that does not require changing the imageFilename and filename of PAGE-XML

This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as filename how would you map that back to the mets:file? We tried to require a convention and it failed - reasonably - before even being tested on real-world data (which is even messier, with NFS file paths used as xlink:href or invalid characters for IDs etc).

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (https://github.com/OCR-D/spec/pull/70 and https://github.com/OCR-D/spec/pull/73) as the exchange format.

bertsky commented 6 years ago

Ok, I got it now, that was really asking for your solution 2.

If we assume that input page files use random strings as filename how would you map that back to the mets:file?

Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data.

So (to be more precise) why not just use

mets:fileGrp USE directory and mets:file ID filename

exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?

Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.

This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core (we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.

I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway.

I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, data/) or filesystem derive from its USE and ID attributes. (And that of course does not rule out OCRD-GITZIP either.)

VolkerHartmann commented 6 years ago

Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or? mets://OCR-D-IMG/OCR-D-IMG_0001

This is also the way we use to download and rename external files referenced in METS. Ok, mets may be an invalid protocol.

bertsky commented 6 years ago

Can't we reference via USE and ID.

If by reference you mean store to the filesystem (at workspace add or workspace clone time) and retrieve from the filesystem (within processors), then this is exactly what I was proposing. (I still do not see the necessity of external file-URL bookkeeping.) After all, the workspace is the filesystem "cache" of a document repository (mets.xml + annotations). Why should it even bother with the filename part of its persistent URLs?

kba commented 6 years ago

@kba note to self: revisit after https://github.com/OCR-D/assets/pull/18

kba commented 5 years ago

We've switched to relative paths throughout. While I still think a mechanism for external URL bookkeeping (as @bertsky puts it) would be useful, it is not currently necessary, so I'll close this until an actual need arises. Thanks for all the feedback.

bertsky commented 5 years ago

Pardon me, I have to bring this up again. So, as of v0.15.2 we have relative paths now, which is great. But apart from #96, which shows some bugs in the general implementation remain, what about the actual original issue presented above?

When I use the workspace add way of importing GT data (and I know of no other), I still see imageFilename staying untouched, and I still end up with the error:

...ocrd_keraslm/test/test_wrapper.py:47:
...ocrd_tesserocr/ocrd_tesserocr/recognize.py:93: in process
    pil_image = self.workspace.resolve_image_as_pil(pcgts.get_Page().imageFilename)
...env3/lib/python3.6/site-packages/ocrd/workspace.py:167: in resolve_image_as_pil
    image_filename = self.download_url(image_url)
...env3/lib/python3.6/site-packages/ocrd/workspace.py:70: in download_url
    return self.resolver.download_to_directory(self.directory, url, src_dir=self.src_dir, **kwargs)
...env3/lib/python3.6/site-packages/ocrd/resolver.py:81: in download_to_directory
    copyfile(url, outfilename)
src = 'kant_aufklaerung_1784_0017.tif', dst = '/tmp/pyocrd-test-ocrd_keraslm/kant.aufklaerung.1784.0017.tif'
E           FileNotFoundError: [Errno 2] No such file or directory: 'kant_aufklaerung_1784_0017.tif'

Unless I am mistaken, this needs to be re-opened, too.

kba commented 5 years ago

Unless I am mistaken, this needs to be re-opened, too.

You're right, I will spend Monday on these issues (#176 #96 etc)

bertsky commented 5 years ago

This is still true with 1.0.0b5. I believe this also affects workspace clone and zip bag besides workspace add.

bertsky commented 5 years ago

I don't think this can wait until the dev workshop.

kba commented 5 years ago

Revisiting this with @tboenig:

So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.

mikegerber commented 5 years ago
* `imageFilename` in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work
* `mets:FLocat` is ideally a relative path from the `mets.xml`

Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct.

bertsky commented 5 years ago

I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time.

mikegerber commented 5 years ago

I used to automatically correct the imageFilename for easy viewing in PAGE Viewer. But with the latest ocrd 1.0.0b19, the situation is worse because ocrd workspace validate now seems to check for the (in my opinion) incorrect METS-relative filenames.

16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524/../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
Traceback (most recent call last):
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png

It also tries to "download" the local file to TEMP and so this seems to be connected to issue #324.

bertsky commented 5 years ago

I am sure the new validation was added in preparation of fixing this within the new logic.

But there is a simple remedy: just --skip=imageFilename

mikegerber commented 5 years ago

Not remedied using the latest master which has this skip option:

% ocrd workspace validate --skip pixel_density --skip imagefilename mets.xml
Traceback (most recent call last):
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/workspace.py", line 100, in download_file
    f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
  File "/home/mike/devel/OCR-D/core/ocrd/ocrd/resolver.py", line 77, in download_to_directory
    raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png
kba commented 5 years ago

PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods.

bertsky commented 4 years ago

So all that remains to do here is fixing workspace add, right?

It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python...

cneud commented 4 years ago

I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth? Do you have an example @bertsky?

kba commented 4 years ago

Until then we all have to live with the hassle of pointing PageViewer to the image every time.

PAGE Viewer has --resolve-dir now https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/6

Do you have an example @bertsky?

I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the imageFilename-relative-to-mets / imageFilename-must-be-in-METS patterns. In most cases, workflows will start with images from which we derive PAGE-XML with correct imageFilename, don't we?

bertsky commented 4 years ago

IIUC, there must not/cannot be a case where the PAGE imageFilename IS NOT relative to the mets.xml - either a PAGE file has been created by some ocrd-* process and thus should always be relative to the mets.xml or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth?

Neither of these cases is what ocrd workspace add is typically used for. You need this for GT files from other sources (or OCR-D GT releases before BagIt/METS, which even now are the only GT with text content). These have varying @imageFilename conventions, depending on their directory structure. Now when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.

One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing ocrd-make repair afterwards, at least sometimes)

But maybe, you'd say, this is too difficult to get right in ocrd workspace add, please use ocrd zip bag for that! But how will this work, if the old URL did not work to begin with?

kba commented 4 years ago

when ocrd workspace add reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace. [...] But maybe, you'd say, this is too difficult to get right in ocrd workspace add

It's a simple enough feature, questions:

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

bertsky commented 4 years ago
* Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement

Yes, that's crucial. If we take this seriously, ocrd workspace add on PAGE-XML files will either take control of that file or make a copy of it (under the "right" path).

* Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.

I guess we have to consider the possibility. If we solve this conceptually for Page/@imageFilename, it should work the same for AlternativeImage/@filename though.

* How to determine file metadata for the `imageFilename`? Media Type can be guessed but what `mets:fileGrp` to add the images to? Maybe the filegroup used as the input plus suffix `-IMG`?

IIUC you assume here that ocrd workspace add will be responsible for adding the image file along with the PAGE-XML file passed to it. We could have other provisions (like assuming the image file must already have been added by then), but let's follow this logic for now:

Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).

Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.

If we add an option, why not just the name of the image file group (or none for "ignore images")?

* Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before

Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now ocrd workspace add can go looking for the (basename of the) filename in the (image) flocat URLs of the METS, and calculate the new relative path for the PAGE-XML under its destination directory. If it does not find an image with that filename, it can still go looking for an image with the same pageId. And then it can fail loudly.

Personally, I think this is the more sensible interface than add-image-via-PAGE.

Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.

This got me confused: I though we are talking about adding PAGE-XML files here?