Closed bertsky closed 1 week ago
I now have some realistic tests as well.
But before I push these: Please help me understand a detail of the current implementation @kba: In …
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/task_sequence.py#L120
…the workspace instantiated from the --mets
argument does not contain a dst_dir
kwarg, which means that it will merely be a clone in a temporary working directory which then via run_cli
makes its way into the individual processor CLIs …
https://github.com/OCR-D/core/blob/79c61e303c87f229d5c96aedc0da31ef82b0f5d3/src/ocrd/processor/helpers.py#L198-L199
So how could this ever have worked?
cf. da7e960 i.e. need for resolve_mets_arguments
is what I'm talking about.
I now have some realistic tests as well.
But before I push these: Please help me understand a detail of the current implementation @kba: In …
…the workspace instantiated from the
--mets
argument does not contain adst_dir
kwarg, which means that it will merely be a clone in a temporary working directory which then viarun_cli
makes its way into the individual processor CLIs …So how could this ever have worked?
TBH, I just never encounter the use case of passing a HTTP URL to a METS file directly to any of the CLI that accept --mets
nowadays, I always clone before and pass a local METS file as --mets
, in which case, dst_dir
is derived from Path(mets_url).parent
.
cf. https://github.com/OCR-D/core/commit/da7e96076f192ba506a70208501ddeeaa167bf63 i.e. need for resolve_mets_arguments is what I'm talking about.
- workspace = resolver.workspace_from_url(mets, mets_server_url=mets_server_url)
+ workdir, mets, basename, _ = resolver.resolve_mets_arguments(None, mets, None)
+ workspace = resolver.workspace_from_url(mets, workdir, mets_basename=basename,
+ mets_server_url=mets_server_url)
What is the intended effect here? Because resolve_mets_arguments(None, mets)
will raise a ValueError if mets
is a remote (HTTP) URL. If it is a local file, it will also return Path(mets).parent
, so I don't see how it is better than just calling workspace_from_url
directly.
But it might very well be that my assumptions are wrong - under what circumstances are the temporary directories a problem?
TBH, I just never encounter the use case of passing a HTTP URL to a METS file directly to any of the CLI that accept
--mets
nowadays,
This is not what I am talking about, and not what da7e960 fixes.
I always clone before and pass a local METS file as
--mets
, in which case,dst_dir
is derived fromPath(mets_url).parent
.
Not it was not: If you passed a local path other than in the CWD to ocrd process
, then prior to da7e960 the dst_dir
argument would be None
and therefore a temporary copy of the workspace would be operated on (which would not even have shown up or persisted after processing).
What is the intended effect here? Because
resolve_mets_arguments(None, mets)
will raise a ValueError ifmets
is a remote (HTTP) URL.
Again, this is not about a remote (http*) METS.
If it is a local file, it will also return
Path(mets).parent
,
Strike also
– and you have the reason for the change and for my question: how could this have gone unnoticed for so long?
so I don't see how it is better than just calling
workspace_from_url
directly.
Because that function needs a dst_dir
argument or will only yield a partial clone.
Because that function needs a
dst_dir
argument or will only yield a partial clone.
I don't have a good explanation right now, I tend to run the tools from the workspace directory so I may not have noticed it because of that. Will investigate further, but since the PR fixes that issue at least where you noticed it and it adds necessary functionality to ocrd process
, let's merge.
fixes #1241