Closed kba closed 1 month ago
The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.
I tend to agree, but the discussion in #425 initially outlined a different scenario:
I do not understand the use-cases for clobber_mets=False, and I know of no example of it The original idea was that you pass a METS URL which is only downloaded the first time it is to be used. Subsequent processors should not overwrite the METS every time they process a workspace.
However, you then followed up with the following explanation, which already speaks to the current line of thought:
The METS might not be downloaded yet. That's why we introduced the idea of a workspace which was initially just a crutch to make it easier for processors and we were careful not to use it as a first-class concept e.g. in our specs. Instead of handling the intricacies of downloading full/partial representations of the work, they now just had to act on a directory and files. Since then however, workspaces have become the standard way to interact with METS/Images/PAGE and most if not all the facilities to handle non-local data outside ocrd workspace has become obsolete.
So IIUC the initial scenario of always using workspace_from_url
to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url
function now entirely falls into the former category, and clobber_mets=False
(still the default) without a failure/exception does not make sense anymore.
So IIUC the initial scenario of always using
workspace_from_url
to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus theworkspace_from_url
function now entirely falls into the former category, andclobber_mets=False
(still the default) without a failure/exception does not make sense anymore
Yeah, it is a bit messy due to poor initial planning and yes, the "clone once then work on the local workspace" approach is the default now, the "on demand" approach is largely obsolete. But I'd keep clobber_mets=False
to avoid introducing a new source of mysterious behavior. I cannot really imagine a scenario where overwriting the METS during clone would be preferable to removing the workspace manually or cloning into a different directory but the flag/kwarg is there if people do need that.
As laid out in #563, the
clobber_mets
behavior was different from what the docstring said. Instead of raising and Exception if the METS file already existed andclobber_mets
beingFalse
, the METS file was silently skipped.The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.