Closed kba closed 2 years ago
Solution 1) On workspace add
, change the imageFilename
of the PAGE.
Solution 2) Have a in place to map from @imageFilename
to @xlink:href
(like "match if fileName is suffix to a mets:file@xlink:href" or "match if fileName is GROUPID of a page and there is a mets:file with mimetype image/* with that GROUPID", this could be automated.)
Solution 3) Track the relation externally. A mechanism like this will be necessary anyway because of the problem of local URLs irreversibly replacing remote URLs when downloading files.
Without overseeing the technical consequences: Only a cosmetic nastiness? I am not sure we ever touch the file refs in PAGE, do we?
See https://ocr-d.github.io/page#url-for-imagefilename--filename
The imageFilename
is necessary to get from page to the mets:file
that represents the image.
Ah, okay. In this case, :+1: for solution 3.
Solution 1 only works if workspace add is used which may be a drawback. Solution 2 sounds complex. There may be several images for the same page (orig, binarized, cropped, deskewed,...) Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step. This may be done each time an export/import is planned. My vote for solution 3.
Solution 1 only works if workspace add is used which may be a drawback.
Solution 3 works out of the box analyzing whole METS and referenced PAGEs in one step.
Solution 3 reasonably also only works on workspace add
since this has to be an external file in the workspace (currently I'm using url-aliases.csv
). It could be populated by hand or external mechanism but then again, so could you change the PAGE by hand (or with sed
).
Pardon my being slow-witted, but what was the reason for https://ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file://
scheme)?
I thought the workspace metaphor would work like a DVCS repository. But if we require URLs everywhere, I cannot move my workspaces around in the filesystem. Am I supposed to clone -l
instead?
(BTW, pack
/ unpack
should also beware of file URLs.)
what was the reason for ocr-d.github.io/page#url-for-imagefilename--filename (always requiring a URL, even if local) in the first place? Why not use relative paths (without file:// scheme)?
The original plan was to completely forgo the filesystem and use a repository for all intermediate results, not just of workflow runs but of individual processors (hence the file resolver and cache etc.). Processors were to download the data by URL, do their thing, upload the data and set URL. file:// URL or relative paths should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc., in a workflow.
The workspace is the place where processors "do their thing", a mere implementation-specific helper for a processor. We considered the mets.xml to be the single source of truth for all data and metadata, it should always be enough to have that mets.xml and access all files via their persisten HTTP URL.
Nowadays, full provenance and reproducibility of every single step is not our top priority anymore. This allows us making that workspace/Git-like approach a first- class concept. We should adapt the specs to reflect this.
Thanks, it makes sense to me now. But what still escapes me is the logic of:
file://
URLs or relative paths should be avoided because having them manifest in the data makes is error-prone when tasks are to be distributed, parallelized etc.
I completely agree as far as file://
URLs are concerned, but relative paths? Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale? Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization (due to synchronization effort). Even with a distributed file system (which is an alternative to URLs with client-server transfer protocols) I would recommend allowing intermediate I/O to be local (temporary).
Anyway, if I understand you correctly, you will move towards allowing local intermediate steps and workspaces as true DVCS. Can I conclude from that relative file names will be your preferred solution for this issue, too? (Or am I misreading your explanation?)
Isn't that manifestation the best way to make a distributed system thrive (as DVCS success shows) and scale?
In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS (workspacec/dvcs metaphor) so the mets.xml acts more as a manifest.
Requiring all computation to do I/O via URLs incurs a huge bottleneck and hinders parallelization
Of course you need some form of caching on a local filesystem. Hence the workspace: Create a local folder with all required files for a processor to work on. In fact that was why originally those were created in /tmp
because it was mounted in RAM and hence fast.
But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.
Can I conclude from that relative file names will be your preferred solution for this issue, too?
No, I would still prefer URL to be used in the data. The best way to avoid having references to local-only data is not to persist it. Instead, I'd be for a mechanism to map local filenames to opaque identifiers, such as a URL or whatever string is in the imageFilename
of a PAGE-XML etc.
In mass digitisation we cannot assume that mets.xml and referenced data are on the same FS
Sorry, I somehow forgot about that (it seems strange to me now, too). Then the DVCS metaphor is perhaps misleading.
So how about this new scheme: A workspace is nothing but an identical copy of the remote mets.xml (using only public URLs) plus the files in relative paths of the local FS – by the same path name convention as ocrd-zip (or something that does not require changing the imageFilename
and filename
of PAGE-XML to mets:fileGrp USE
directory and mets:file ID
filename).
But once the local processes are complete, ensure that all data is stored persistently and no references to local files remain. You need to do that I/O at some point, download all the files, keep track which local file represents which file URL, and in the end store it somewhere persistenly.
Yes, understood. In the above scheme, there would be no local reference (to keep track of) any more. So as soon as a new files gets added, one would be required to provide a URL for it. Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.
Then the DVCS metaphor is perhaps misleading.
I think it is useful for individual processes, as an abstraction for implementers and for workflow/archiving purposes. But from the perspective of a digitisation engineer/sysadmin, it's best to assume:
the same path name convention as ocrd-zip (or something that does not require changing the
imageFilename
andfilename
of PAGE-XML
This is tricky. It's what I meant by solution 2) above. If we assume that input page files use random strings as filename
how would you map that back to the mets:file
? We tried to require a convention and it failed - reasonably - before even being tested on real-world data (which is even messier, with NFS file paths used as xlink:href
or invalid characters for IDs etc).
Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.
This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into core
(we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.
IIUC (@VolkerHartmann @wrznr?) we won't have a repository on the task level, we cannot enforce naming conventions in input data. That leaves us with the option to have external mappings between identifiers, pure file-system access to files and OCRD-ZIP with the planned BagIt+Git extensions (https://github.com/OCR-D/spec/pull/70 and https://github.com/OCR-D/spec/pull/73) as the exchange format.
Ok, I got it now, that was really asking for your solution 2.
If we assume that input page files use random strings as
filename
how would you map that back to themets:file
?
Well I did not say random, I rather meant a convention that fits most existing naming schemes. But I can imagine how that quickly collapses with real-world data.
So (to be more precise) why not just use
mets:fileGrp USE
directory andmets:file ID
filename
exclusively, with sollution 1 (and not changing URLs in the METS at all)? And invalid characters for ID would be a problem in any case, wouldn't they?
Making persistent then simply uploads the modified mets.xml plus the added files to the new URLs.
This is the what I'd consider the repository approach: A service that accepts PUT/POST requests and GET download requests. I would have preferred that six months ago, but as you said earlier, it's much more effort for the processors to fetch&upload. It's reasonably easy to integrate into
core
(we experimented with that early on) but not all contributors build on it and it makes testing much harder, requires a repository server etc.
I am not sure I understand that, yet. Making persistent in my sense happens at the end of the workflow pipeline. Everything in between can happen locally, and processors could be allowed to create "temporary" annotations (marked by, say, bogus URLs) in between. As you said earlier, at some point that initial/final I/O needs to happen anyway.
I would also favour established standards like BagIt over a self-baked OCRD-ZIP here. But if I am not mistaken, then an external mapping would not be necessary: all the URLs stay in the mets.xml, all directory and file names in the archive (in this case, data/
) or filesystem derive from its USE
and ID
attributes. (And that of course does not rule out OCRD-GITZIP either.)
Can't we reference via USE and ID. The modules should already know these values as they have to address the file via these values, or? mets://OCR-D-IMG/OCR-D-IMG_0001
This is also the way we use to download and rename external files referenced in METS. Ok, mets may be an invalid protocol.
Can't we reference via USE and ID.
If by reference you mean store to the filesystem (at workspace add
or workspace clone
time) and retrieve from the filesystem (within processors), then this is exactly what I was proposing. (I still do not see the necessity of external file-URL bookkeeping.) After all, the workspace is the filesystem "cache" of a document repository (mets.xml + annotations). Why should it even bother with the filename part of its persistent URLs?
@kba note to self: revisit after https://github.com/OCR-D/assets/pull/18
We've switched to relative paths throughout. While I still think a mechanism for external URL bookkeeping (as @bertsky puts it) would be useful, it is not currently necessary, so I'll close this until an actual need arises. Thanks for all the feedback.
Pardon me, I have to bring this up again. So, as of v0.15.2 we have relative paths now, which is great. But apart from #96, which shows some bugs in the general implementation remain, what about the actual original issue presented above?
When I use the workspace add
way of importing GT data (and I know of no other), I still see imageFilename
staying untouched, and I still end up with the error:
...ocrd_keraslm/test/test_wrapper.py:47:
...ocrd_tesserocr/ocrd_tesserocr/recognize.py:93: in process
pil_image = self.workspace.resolve_image_as_pil(pcgts.get_Page().imageFilename)
...env3/lib/python3.6/site-packages/ocrd/workspace.py:167: in resolve_image_as_pil
image_filename = self.download_url(image_url)
...env3/lib/python3.6/site-packages/ocrd/workspace.py:70: in download_url
return self.resolver.download_to_directory(self.directory, url, src_dir=self.src_dir, **kwargs)
...env3/lib/python3.6/site-packages/ocrd/resolver.py:81: in download_to_directory
copyfile(url, outfilename)
src = 'kant_aufklaerung_1784_0017.tif', dst = '/tmp/pyocrd-test-ocrd_keraslm/kant.aufklaerung.1784.0017.tif'
E FileNotFoundError: [Errno 2] No such file or directory: 'kant_aufklaerung_1784_0017.tif'
Unless I am mistaken, this needs to be re-opened, too.
Unless I am mistaken, this needs to be re-opened, too.
You're right, I will spend Monday on these issues (#176 #96 etc)
This is still true with 1.0.0b5. I believe this also affects workspace clone
and zip bag
besides workspace add
.
I don't think this can wait until the dev workshop.
Revisiting this with @tboenig:
imageFilename
in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't workmets:FLocat
is ideally a relative path from the mets.xml
So we need logic to determine the relative path from mets.xml to image by resolving imageFilename of a PAGE against the relative path to that PAGE.
* `imageFilename` in PAGE must always be a relative file path relative to that PAGE file, otherwise tools like Aletheia or PAGEViewer won't work * `mets:FLocat` is ideally a relative path from the `mets.xml`
Is this the consensus now? Because a. I want/need to use the PAGE Viewer and b. it also seems correct.
I think so. But this will have repercussions all over our implementations: until now, everything was relative to METS. And we have an additional interdepenceny between tools and data (GT bags) here. So it might take some time until this is available. Until then we all have to live with the hassle of pointing PageViewer to the image every time.
I used to automatically correct the imageFilename
for easy viewing in PAGE Viewer. But with the latest ocrd 1.0.0b19, the situation is worse because ocrd workspace validate
now seems to check for the (in my opinion) incorrect METS-relative filenames.
16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
16:24:54.211 INFO ocrd.resolver.download_to_directory - directory=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524| url=|/srv/data/qurator-data/OCR-D-GT-repacked/busmexpo_742567524/../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png| basename=|OCR-D-IMG-BIN_0001.png| if_exists=|skip| subdir=|TEMP|
Traceback (most recent call last):
File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/workspace.py", line 100, in download_file
f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
File "/home/mike/.virtualenvs/ocrd/lib/python3.7/site-packages/ocrd/resolver.py", line 77, in download_to_directory
raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png
It also tries to "download" the local file to TEMP
and so this seems to be connected to issue #324.
I am sure the new validation was added in preparation of fixing this within the new logic.
But there is a simple remedy: just --skip=imageFilename
Not remedied using the latest master which has this skip option:
% ocrd workspace validate --skip pixel_density --skip imagefilename mets.xml
Traceback (most recent call last):
File "/home/mike/devel/OCR-D/core/ocrd/ocrd/workspace.py", line 100, in download_file
f.url = self.resolver.download_to_directory(self.directory, f.url, subdir=f.fileGrp, basename=basename)
File "/home/mike/devel/OCR-D/core/ocrd/ocrd/resolver.py", line 77, in download_to_directory
raise FileNotFoundError("File path passed as 'url' to download_to_directory does not exist: %s" % url)
FileNotFoundError: File path passed as 'url' to download_to_directory does not exist: ../OCR-D-IMG-BIN/OCR-D-IMG-BIN_0001.png
PAGE filenames will have to be relative to the METS. PAGE Viewer and Aletheia will have options to change the base for relative filenames. Since #333 PAGE filenames in OCRD-ZIP will be updated, but this has not yet been implemented for general workspace methods.
So all that remains to do here is fixing workspace add
, right?
It should be simple to implement something along the lines of https://github.com/OCR-D/docs/blob/master/fix-gt.sh in core Python...
I admit I am slightly puzzled what still needs fixing here...IIUC, there must not/cannot be a case where the PAGE imageFilename
IS NOT relative to the mets.xml
- either a PAGE file has been created by some ocrd-*
process and thus should always be relative to the mets.xml
or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth? Do you have an example @bertsky?
Until then we all have to live with the hassle of pointing PageViewer to the image every time.
PAGE Viewer has --resolve-dir
now https://github.com/PRImA-Research-Lab/prima-page-viewer/issues/6
Do you have an example @bertsky?
I would also find that helpful. I'm having a hard time thinking of a case where we add to a workflow PAGE-XML that does not already adhere to the imageFilename
-relative-to-mets / imageFilename
-must-be-in-METS patterns. In most cases, workflows will start with images from which we derive PAGE-XML with correct imageFilename
, don't we?
IIUC, there must not/cannot be a case where the PAGE
imageFilename
IS NOT relative to themets.xml
- either a PAGE file has been created by someocrd-*
process and thus should always be relative to themets.xml
or the PAGE file is ground truth in which case we also (need to) ensure this is the case. Or am I missing sth?
Neither of these cases is what ocrd workspace add
is typically used for. You need this for GT files from other sources (or OCR-D GT releases before BagIt/METS, which even now are the only GT with text content). These have varying @imageFilename
conventions, depending on their directory structure. Now when ocrd workspace add
reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace.
One obvious use-case would be ocrd-import. (But in that repo, you can still work around the problem by doing ocrd-make repair
afterwards, at least sometimes)
But maybe, you'd say, this is too difficult to get right in ocrd workspace add
, please use ocrd zip bag
for that! But how will this work, if the old URL did not work to begin with?
when
ocrd workspace add
reads a PAGE-XML file, it can still resolve the original image in the filesystem, and try to rebase to the workspace. [...] But maybe, you'd say, this is too difficult to get right inocrd workspace add
It's a simple enough feature, questions:
imageFilename
? Media Type can be guessed but what mets:fileGrp
to add the images to? Maybe the filegroup used as the input plus suffix -IMG
?Let's make it toggleable with a --include-page-images/--no-include-page-images
or similar flag.
Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.
* Moving images and PAGE to the workspace will require changing the input PAGE. Not really a question, just a statement
Yes, that's crucial. If we take this seriously, ocrd workspace add
on PAGE-XML files will either take control of that file or make a copy of it (under the "right" path).
* Also do this for AlternativeImage? Does anyone beside us even use them? I suppose yes and no.
I guess we have to consider the possibility. If we solve this conceptually for Page/@imageFilename
, it should work the same for AlternativeImage/@filename
though.
* How to determine file metadata for the `imageFilename`? Media Type can be guessed but what `mets:fileGrp` to add the images to? Maybe the filegroup used as the input plus suffix `-IMG`?
IIUC you assume here that ocrd workspace add
will be responsible for adding the image file along with the PAGE-XML file passed to it. We could have other provisions (like assuming the image file must already have been added by then), but let's follow this logic for now:
Yes, the image could be placed under a fileGrp implicitly derived from the fileGrp for the PAGE-XML, or even the same fileGrp (just with a different MIME type and not appearing in the structMap).
Let's make it toggleable with a --include-page-images/--no-include-page-images or similar flag.
If we add an option, why not just the name of the image file group (or none for "ignore images")?
* Any issues that arise from necessary conventions for this are the user's responsibility, i.e. if they want to set a different name or different media type for an image, they either need to post-process the XML themselves or not use this feature and do the image adding themselves as before
Right. And let's think about the second use-case (adding PAGE-XML after image) more thoroughly: Now ocrd workspace add
can go looking for the (basename of the) filename in the (image) flocat URLs of the METS, and calculate the new relative path for the PAGE-XML under its destination directory. If it does not find an image with that filename, it can still go looking for an image with the same pageId. And then it can fail loudly.
Personally, I think this is the more sensible interface than add-image-via-PAGE.
Let's default NOT to do this because it really only makes sense when importing data, not. e.g everytime a bashlib processor wants to add an image.
This got me confused: I though we are talking about adding PAGE-XML files here?
Scenario:
Create a METS file and run
workspace add
:Now the PAGE
imageFilename
andxlink:href
of the corresponding mets:file do not match anymore.