OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

Raise FileExistsError in Resolver.workspace_from_url if clobber_mets is False #1268

Closed kba closed 1 month ago

kba commented 2 months ago

As laid out in #563, the clobber_mets behavior was different from what the docstring said. Instead of raising and Exception if the METS file already existed and clobber_mets being False, the METS file was silently skipped.

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

bertsky commented 2 months ago

The easiest way would have been to just change the docstring. But the current behavior is problematic because silently skipping downloading/copying the METS XML will lead to hard-to-debug errors when users subsequently work on the wrong METS XML file.

I tend to agree, but the discussion in #425 initially outlined a different scenario:

I do not understand the use-cases for clobber_mets=False, and I know of no example of it The original idea was that you pass a METS URL which is only downloaded the first time it is to be used. Subsequent processors should not overwrite the METS every time they process a workspace.

However, you then followed up with the following explanation, which already speaks to the current line of thought:

The METS might not be downloaded yet. That's why we introduced the idea of a workspace which was initially just a crutch to make it easier for processors and we were careful not to use it as a first-class concept e.g. in our specs. Instead of handling the intricacies of downloading full/partial representations of the work, they now just had to act on a directory and files. Since then however, workspaces have become the standard way to interact with METS/Images/PAGE and most if not all the facilities to handle non-local data outside ocrd workspace has become obsolete.

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore.

kba commented 1 month ago

So IIUC the initial scenario of always using workspace_from_url to get a handle on METS and its contents already went out of the window, and we now have to properly differentiate between creating and using a workspace. Thus the workspace_from_url function now entirely falls into the former category, and clobber_mets=False (still the default) without a failure/exception does not make sense anymore

Yeah, it is a bit messy due to poor initial planning and yes, the "clone once then work on the local workspace" approach is the default now, the "on demand" approach is largely obsolete. But I'd keep clobber_mets=False to avoid introducing a new source of mysterious behavior. I cannot really imagine a scenario where overwriting the METS during clone would be preferable to removing the workspace manually or cloning into a different directory but the flag/kwarg is there if people do need that.