Closed mikegerber closed 4 years ago
I see the need, too.
It is also already possible to use the CLI's workspace backup
facility for this, but you have to be careful to create a backup beforehand, and undo
does not always pick the right version, and restore
sometimes fails because the checksum is not unique (and then requires a full backup filename).
But I generally would prefer to have an option for overwriting existing groups and files instead.
@mikegerber BTW you should also remove those file IDs in the structMap/div/fptr
references. Here is an additional XPath expression to have xmlstarlet delete (before the actual removal):
//mets:structMap//mets:fptr[@FILEID=//mets:fileGrp[@USE='$filegrp_use']/mets:file/@ID]
I see the need, too.
OK then, ocrd workspace rm-group
shouldn't be too hard to implement. Should this behave like rmdir
and only delete empty fileGroups or like rm -r
and also delete non-empty fileGroups?
Probably should also have the equivalent for removing files then, i.e. ocrd workspace rm
. What do you think?
But I generally would prefer to have an option for overwriting existing groups and files instead.
Like an option --overwrite
/--force
for ocrd workspace add
?
Like an option --overwrite/--force for ocrd workspace add?
We already have --force
("If file with ID already exists, replace it"). What behavior changes do you want here?
Like an option --overwrite/--force for ocrd workspace add?
We already have
--force
("If file with ID already exists, replace it"). What behavior changes do you want here?
No, I meant an option for ocrd process
and all individual module processor CLIs. (The --force
option is not exposed to those CLIs.)
Thus we can have repeated runs without even looking at which files and groups are affected.
OK then,
ocrd workspace rm-group
shouldn't be too hard to implement. Should this behave likermdir
and only delete empty fileGroups or likerm -r
and also delete non-empty fileGroups?Probably should also have the equivalent for removing files then, i.e.
ocrd workspace rm
. What do you think?
IMHO, this depends on the above: if we get a simple way to repeat processing steps, then these removal operations would become but an extra, for special circumstances. So it would be okay to be cautious and have the user first delete all files (and file references!) – like rm
– and then empty groups – like rmdir
.
But if (for some reason) we don't get some ocrd process -f
, then removal should be easy to use – like rm -fr
.
@mikegerber BTW you should also remove those file IDs in the
structMap/div/fptr
references.
My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?
But if (for some reason) we don't get some
ocrd process -f
, then removal should be easy to use – likerm -fr
.
I agree.
@mikegerber BTW you should also remove those file IDs in the
structMap/div/fptr
references.My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?
But it should could some day. Since GROUPID
was abandoned in favour of physical page ID, every processor should at least pass on the page id of the input. (And those processors which already do will create such additional references.)
See here
My current WIP workflow using ocrd-typegroups-classifier, ocrd-ocropy-segment & ocrd-tesserocr-recognize does not create any such references?
It should do that when using the OcrdMets/OcrdFile APIs and specifying a pageId
for workspace.add_file
. But it's optional for the ocrd workspace add
command and not implemented by the MPs you've listed. When iterating over the input files, the pageId
of those input files should be passed on to the output files...
Would it make sense to require the pageId to be set and fail if it is missing for all add operations?
self.workspace.add_file(
ID=ID,
file_grp=self.output_file_grp,
mimetype=MIMETYPE_PAGE,
local_filename="%s/%s" % (self.output_file_grp, ID),
content=to_xml(pcgts)
)
self.workspace.add_file(
ID=ID,
file_grp=self.output_file_grp,
mimetype=MIMETYPE_PAGE,
local_filename="%s/%s" % (self.output_file_grp, ID),
content=to_xml(pcgts),
pageId=input_file.pageId
)
Sorry, that was redundant. What @bertsky said
Would it make sense to require the pageId to be set and fail if it is missing for all add operations?
I think it should be required for Processor
when it wants to add page-level result files – this is my understanding of the spec. But there are other cases, where it must not be required: adding files for AlternativeImage
for example.
There are now (as of https://github.com/OCR-D/core/releases/tag/v1.0.0b13) two commands for this:
ocrd workspace remove [--force] ID
to delete file with ID ID
. If --force
: delete from local storage if file references local fileocrd workspace remove-group [--recursive] [--force]
to delete file group GROUP
. If --recursive
is given, also delete containing mets:files. If --force
is given, also delete local files.Programmatic access via remove(ID)
and remove_group(GROUP)
in Workspace
and OcrdMets
.
Comment if you need further file removal capabilities in ocrd process
.
Sorry to bring this up so late, but could we make both the ID
argument for remove
and the GROUP
argument for remove-group
multi-valent (i.e. Click's nargs=-1
)?
Also, would it be much extra effort to also provide ocrd process --overwrite
(and/or the same option for ocrd.decorators.ocrd_cli_options
for all individual CLIs)?
IMO the current situation with removal is still unsatisfactory. We have --force
for remove
and remove-group
, but this does not behave like rm -f
at all (despite our discussion), instead it allows for removal in local filesystem (which should really be unconditional). It complains if the group/file did not exist yet – this is what I would expect to suppress via --force
. And without --overwrite
/ -o
in the processor, we always have to write extra steps for removal in our workflow configurations.
I am currently exploring the possibility of makefileization for workflow configuration – squashing configuration syntax (including incremental parameters, what can be done in parallel, and what should be evaluated in the end) and engine logic (calculation of dependencies for rebuilding partially and parallel execution). For that alone the aforementioned changes would be highly valuable.
To elaborate on --force
: Currently I have 2 options:
-f
and leave lots of stale files and directories, which I cannot trivally remove without the risk of removing files and directories along which are still referenced in the METS. -f
but risk stumbling across cases were files have accidentally been removed already, in which case the METS does not get updated at all, but the remove-group
action aborts:
workspace.remove_file_group(g, recursive, force)
File "core/ocrd/ocrd/workspace.py", line 148, in remove_file_group
self.remove_file(f.ID, force=force)
File "core/ocrd/ocrd/workspace.py", line 130, in remove_file
unlink(ocrd_file.local_filename)
FileNotFoundError
Neither of those can be scripted easily.
And rmdir
is missing in the current implementation as well (it should belong to remove-group
).
And a short option variant -f
is missing for remove
.
I'd like to note that the missing rmdir
and the FileNotFoundError
is something that I deal with every day when handling workspaces. (Of course I deal with a lot of slightly inconsistent workspaces due to developing, fixing bugs etc., nevertheless it would make life more pleasant if the rmdir
was there and the FileNotFoundError was caught.)
Same here. This not only complicates development. I also see it coming with the pilot users.
I hear you, I'm on it. This, the TEMP-directory-on-validation and input-file-copying, and improving ocrd process are next on my agenda for core.
Is it already possible to add all files of a directory as a new group, or would that require an add-group
command?
No, you have to add files sequentially at the moment. I'll work on the group removal tomorrow, maybe add-group isn't too complicated. We have the need as well.
On Mon, Jan 13, 2020, 20:47 Stefan Weil notifications@github.com wrote:
Is it already possible to add all files of a directory as a new group, or would that require an add-group command?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/OCR-D/core/issues/245?email_source=notifications&email_token=AACCXV2USLXTJKZXOHXGAHDQ5TAM3A5CNFSM4HZJDR32YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEI2BHCQ#issuecomment-573838218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCXV3DUAMNT6X5C47GF7DQ5TAM3ANCNFSM4HZJDR3Q .
Removal Implemented in #409
ocrd workspace delete-group
could be a command to delete a group/fileGrp
in the workspace/METS file.My use case for this:
Currently, I use xmlstarlet to remove a group of files so that I can repeat a script run, i.e. to enable repeated (and almost idempotent) runs of the following: