OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

workspace bagger: allow selecting pages for download/inclusion #1215

Open bertsky opened 5 months ago

bertsky commented 5 months ago

It would be nice if ocrd zip bag supported creating partial clones with some FLocats as mere URL instead of local paths in the payload.

Possible use cases:

On the CLI, it would just be another option, but I am not sure it's even allowed in the Bagit data format.

MehmedGIT commented 3 months ago

Here is the request we talked about during our meeting today. Please take a look at the following block of code:

    workspace = Workspace(resolver, directory=workspace_dir, mets_basename=mets_basename)
    WorkspaceBagger(resolver).bag(
        workspace, 
        ocrd_identifier=ocrd_identifier, 
        dest=bag_dest, 
        ocrd_mets=mets_basename, 
        processes=1
    )

It would be great if the WorkspaceBagger.bag() method also took an extra flag skip_download to avoid downloading file groups not existing on the local storage. There are, of course, white- and blacklist options with include_fileGrp and exclude_fileGrp to achieve that by simply ignoring some file groups, but that requires some extra steps plus knowledge of what file groups are locally available and which are not. I am mainly interested in doing that programmatically. How the bagger CLI should handle skip_download does not matter much, so no extra requirements there.