OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
118 stars 31 forks source link

Ability to delete a group #245

Closed mikegerber closed 4 years ago

mikegerber commented 5 years ago

ocrd workspace delete-group could be a command to delete a group/fileGrp in the workspace/METS file.

My use case for this:

Currently, I use xmlstarlet to remove a group of files so that I can repeat a script run, i.e. to enable repeated (and almost idempotent) runs of the following:

remove_filegrp() {
  filegrp_use=$1
  mets=$2

  xmlstarlet ed --inplace \
    -N mets=http://www.loc.gov/METS/ \
    -d "//mets:fileGrp[@USE='$filegrp_use']" $mets
}

remove_filegrp OCR-D-SEG-LINE mets.xml
ocrd-ocropy-segment -l DEBUG -m mets.xml -I OCR-D-IMG -O OCR-D-SEG-LINE
bertsky commented 5 years ago

I see the need, too.

It is also already possible to use the CLI's workspace backup facility for this, but you have to be careful to create a backup beforehand, and undo does not always pick the right version, and restore sometimes fails because the checksum is not unique (and then requires a full backup filename).

But I generally would prefer to have an option for overwriting existing groups and files instead.

bertsky commented 5 years ago

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references. Here is an additional XPath expression to have xmlstarlet delete (before the actual removal):

//mets:structMap//mets:fptr[@FILEID=//mets:fileGrp[@USE='$filegrp_use']/mets:file/@ID]
kba commented 5 years ago

I see the need, too.

OK then, ocrd workspace rm-group shouldn't be too hard to implement. Should this behave like rmdir and only delete empty fileGroups or like rm -r and also delete non-empty fileGroups?

Probably should also have the equivalent for removing files then, i.e. ocrd workspace rm. What do you think?

But I generally would prefer to have an option for overwriting existing groups and files instead.

Like an option --overwrite/--force for ocrd workspace add?

kba commented 5 years ago

Like an option --overwrite/--force for ocrd workspace add?

We already have --force ("If file with ID already exists, replace it"). What behavior changes do you want here?

bertsky commented 5 years ago

Like an option --overwrite/--force for ocrd workspace add?

We already have --force ("If file with ID already exists, replace it"). What behavior changes do you want here?

No, I meant an option for ocrd process and all individual module processor CLIs. (The --force option is not exposed to those CLIs.)

Thus we can have repeated runs without even looking at which files and groups are affected.

OK then, ocrd workspace rm-group shouldn't be too hard to implement. Should this behave like rmdir and only delete empty fileGroups or like rm -r and also delete non-empty fileGroups?

Probably should also have the equivalent for removing files then, i.e. ocrd workspace rm. What do you think?

IMHO, this depends on the above: if we get a simple way to repeat processing steps, then these removal operations would become but an extra, for special circumstances. So it would be okay to be cautious and have the user first delete all files (and file references!) – like rm – and then empty groups – like rmdir.

But if (for some reason) we don't get some ocrd process -f, then removal should be easy to use – like rm -fr.

mikegerber commented 5 years ago

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references.

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

mikegerber commented 5 years ago

But if (for some reason) we don't get some ocrd process -f, then removal should be easy to use – like rm -fr.

I agree.

bertsky commented 5 years ago

@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr references.

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

But it should could some day. Since GROUPID was abandoned in favour of physical page ID, every processor should at least pass on the page id of the input. (And those processors which already do will create such additional references.)

See here

kba commented 5 years ago

My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?

It should do that when using the OcrdMets/OcrdFile APIs and specifying a pageId for workspace.add_file. But it's optional for the ocrd workspace add command and not implemented by the MPs you've listed. When iterating over the input files, the pageId of those input files should be passed on to the output files...

Would it make sense to require the pageId to be set and fail if it is missing for all add operations?

kba commented 5 years ago

OLD:

self.workspace.add_file(
    ID=ID,
    file_grp=self.output_file_grp,
    mimetype=MIMETYPE_PAGE,
    local_filename="%s/%s" % (self.output_file_grp, ID),
    content=to_xml(pcgts)
)

NEW

self.workspace.add_file(
    ID=ID,
    file_grp=self.output_file_grp,
    mimetype=MIMETYPE_PAGE,
    local_filename="%s/%s" % (self.output_file_grp, ID),
    content=to_xml(pcgts),
    pageId=input_file.pageId
)
kba commented 5 years ago

Sorry, that was redundant. What @bertsky said

bertsky commented 5 years ago

Would it make sense to require the pageId to be set and fail if it is missing for all add operations?

I think it should be required for Processor when it wants to add page-level result files – this is my understanding of the spec. But there are other cases, where it must not be required: adding files for AlternativeImage for example.

kba commented 5 years ago

There are now (as of https://github.com/OCR-D/core/releases/tag/v1.0.0b13) two commands for this:

Programmatic access via remove(ID) and remove_group(GROUP) in Workspace and OcrdMets.

Comment if you need further file removal capabilities in ocrd process.

bertsky commented 5 years ago

Sorry to bring this up so late, but could we make both the ID argument for remove and the GROUP argument for remove-group multi-valent (i.e. Click's nargs=-1)?

Also, would it be much extra effort to also provide ocrd process --overwrite (and/or the same option for ocrd.decorators.ocrd_cli_options for all individual CLIs)?

bertsky commented 5 years ago

IMO the current situation with removal is still unsatisfactory. We have --force for remove and remove-group, but this does not behave like rm -f at all (despite our discussion), instead it allows for removal in local filesystem (which should really be unconditional). It complains if the group/file did not exist yet – this is what I would expect to suppress via --force. And without --overwrite / -o in the processor, we always have to write extra steps for removal in our workflow configurations.

I am currently exploring the possibility of makefileization for workflow configuration – squashing configuration syntax (including incremental parameters, what can be done in parallel, and what should be evaluated in the end) and engine logic (calculation of dependencies for rebuilding partially and parallel execution). For that alone the aforementioned changes would be highly valuable.

bertsky commented 5 years ago

To elaborate on --force: Currently I have 2 options:

  1. not use -f and leave lots of stale files and directories, which I cannot trivally remove without the risk of removing files and directories along which are still referenced in the METS.
  2. use -f but risk stumbling across cases were files have accidentally been removed already, in which case the METS does not get updated at all, but the remove-group action aborts:
    workspace.remove_file_group(g, recursive, force)
    File "core/ocrd/ocrd/workspace.py", line 148, in remove_file_group
    self.remove_file(f.ID, force=force)
    File "core/ocrd/ocrd/workspace.py", line 130, in remove_file
    unlink(ocrd_file.local_filename)
    FileNotFoundError

Neither of those can be scripted easily.

bertsky commented 5 years ago

And rmdir is missing in the current implementation as well (it should belong to remove-group).

bertsky commented 4 years ago

And a short option variant -f is missing for remove.

mikegerber commented 4 years ago

I'd like to note that the missing rmdir and the FileNotFoundError is something that I deal with every day when handling workspaces. (Of course I deal with a lot of slightly inconsistent workspaces due to developing, fixing bugs etc., nevertheless it would make life more pleasant if the rmdir was there and the FileNotFoundError was caught.)

bertsky commented 4 years ago

Same here. This not only complicates development. I also see it coming with the pilot users.

kba commented 4 years ago

I hear you, I'm on it. This, the TEMP-directory-on-validation and input-file-copying, and improving ocrd process are next on my agenda for core.

stweil commented 4 years ago

Is it already possible to add all files of a directory as a new group, or would that require an add-group command?

kba commented 4 years ago

No, you have to add files sequentially at the moment. I'll work on the group removal tomorrow, maybe add-group isn't too complicated. We have the need as well.

On Mon, Jan 13, 2020, 20:47 Stefan Weil notifications@github.com wrote:

Is it already possible to add all files of a directory as a new group, or would that require an add-group command?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/OCR-D/core/issues/245?email_source=notifications&email_token=AACCXV2USLXTJKZXOHXGAHDQ5TAM3A5CNFSM4HZJDR32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI2BHCQ#issuecomment-573838218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCXV3DUAMNT6X5C47GF7DQ5TAM3ANCNFSM4HZJDR3Q .

kba commented 4 years ago

Removal Implemented in #409