OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
117 stars 31 forks source link

ocrd process via METS server option #1243

Closed bertsky closed 1 week ago

bertsky commented 2 weeks ago

fixes #1241

bertsky commented 2 weeks ago

I now have some realistic tests as well.

But before I push these: Please help me understand a detail of the current implementation @kba: In … https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/task_sequence.py#L120 …the workspace instantiated from the --mets argument does not contain a dst_dir kwarg, which means that it will merely be a clone in a temporary working directory which then via run_cli makes its way into the individual processor CLIs … https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/processor/helpers.py#L198-L199

So how could this ever have worked?

bertsky commented 2 weeks ago

cf. da7e960 i.e. need for resolve_mets_arguments is what I'm talking about.

kba commented 2 weeks ago

I now have some realistic tests as well.

But before I push these: Please help me understand a detail of the current implementation @kba: In …

https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/task_sequence.py#L120

…the workspace instantiated from the --mets argument does not contain a dst_dir kwarg, which means that it will merely be a clone in a temporary working directory which then via run_cli makes its way into the individual processor CLIs …

https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/processor/helpers.py#L198-L199

So how could this ever have worked?

TBH, I just never encounter the use case of passing a HTTP URL to a METS file directly to any of the CLI that accept --mets nowadays, I always clone before and pass a local METS file as --mets, in which case, dst_dir is derived from Path(mets_url).parent.

cf. https://github.com/OCR-D/core/commit/da7e96076f192ba506a70208501ddeeaa167bf63 i.e. need for resolve_mets_arguments is what I'm talking about.

-    workspace = resolver.workspace_from_url(mets, mets_server_url=mets_server_url)
+    workdir, mets, basename, _ = resolver.resolve_mets_arguments(None, mets, None)
+    workspace = resolver.workspace_from_url(mets, workdir, mets_basename=basename,
+                                            mets_server_url=mets_server_url)

What is the intended effect here? Because resolve_mets_arguments(None, mets) will raise a ValueError if mets is a remote (HTTP) URL. If it is a local file, it will also return Path(mets).parent, so I don't see how it is better than just calling workspace_from_url directly.

But it might very well be that my assumptions are wrong - under what circumstances are the temporary directories a problem?

bertsky commented 2 weeks ago

TBH, I just never encounter the use case of passing a HTTP URL to a METS file directly to any of the CLI that accept --mets nowadays,

This is not what I am talking about, and not what da7e960 fixes.

I always clone before and pass a local METS file as --mets, in which case, dst_dir is derived from Path(mets_url).parent.

Not it was not: If you passed a local path other than in the CWD to ocrd process, then prior to da7e960 the dst_dir argument would be None and therefore a temporary copy of the workspace would be operated on (which would not even have shown up or persisted after processing).

What is the intended effect here? Because resolve_mets_arguments(None, mets) will raise a ValueError if mets is a remote (HTTP) URL.

Again, this is not about a remote (http*) METS.

If it is a local file, it will also return Path(mets).parent,

Strike also – and you have the reason for the change and for my question: how could this have gone unnoticed for so long?

so I don't see how it is better than just calling workspace_from_url directly.

Because that function needs a dst_dir argument or will only yield a partial clone.

kba commented 1 week ago

Because that function needs a dst_dir argument or will only yield a partial clone.

I don't have a good explanation right now, I tend to run the tools from the workspace directory so I may not have noticed it because of that. Will investigate further, but since the PR fixes that issue at least where you noticed it and it adds necessary functionality to ocrd process, let's merge.