OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
120 stars 32 forks source link

ocrd workspace rename-group: update file refs in ALTO, too #913

Open bertsky opened 2 years ago

bertsky commented 2 years ago

The current implementation of Workspace.rename_file_group is smart by going after the affected image file references within PAGE files as well:

https://github.com/OCR-D/core/blob/71d295ac1fccbeb4164e230bd584e1920b9ab3c8/ocrd/ocrd/workspace.py#L324-L342

It would be even better if ALTO files (i.e. /alto/Description/sourceImageInformation/fileName) were updated in a similar fashion.

bertsky commented 2 years ago

Also, I think it would be useful to add an option for not moving any local files around at all, including ID changes. (In that case, no references need to be updated. And it is much faster.)

Another option would be to offer just making the new group an alias of the old one (as implemented via XSLT 1.0 in workflow-configuration).

bertsky commented 11 months ago

Another option would be to offer just making the new group an alias of the old one (as implemented via XSLT 1.0 in workflow-configuration).

@kba should we make that a separate issue? (Use-cases are aliasing input fileGrp to OCR-D-IMG for our common workflows, or aliasing output fileGrp FULLTEXT to ALTO for myCore.)

bertsky commented 11 months ago

Another option would be to offer just making the new group an alias of the old one (as implemented via XSLT 1.0 in workflow-configuration).

Ouch, just noticed that mets-alias-filegrp.xsl is fundamentally broken, for it is not allowed to reuse the same XML IDs – I would have to rename them in the new fileGrp (and re-reference them in the physical structmap). Since this kind of thing cannot easily be done in XSL (v1.0 anyway), let's please provide that via Python.