The ocrd workspace bulk-add command allows adding many files to a workspace/METS in one go, which is significantly more efficient than doing an external loop e.g. in Bash and adding files individually with ocrd workspace add.
The bulk-add command is based on regular expressions which are applied to the list of files to be added. By applying these patterns on the filenames, the values for fileGrp, ID, pageID etc. are derived.
There are two major drawbacks to this approach:
unintuitive: Regular expressions are notoriously cryptic, esp. for people not used to the concept.
rigid: For the patterns to match consistently, the files need to be named and placed in subdirectories consistently. Real world data is often messy and "incorrectly" named files will lead to incomplete additions to the workspace/METS
How it should be
Instead of just filenames, allow users to prepare a whitespace-delimited list of fields to feed into bulk-add, either via command line arguments or by reading from STDIN to allow users to pipe a CSV, possibly created with a spreadsheet tool, into the CLI.
With this approach, users still need to define a regular expression, but it is much simpler, essentially the header line of a spreadsheet, defining the fields.
Current situation
The
ocrd workspace bulk-add
command allows adding many files to a workspace/METS in one go, which is significantly more efficient than doing an external loop e.g. in Bash and adding files individually withocrd workspace add
.The
bulk-add
command is based on regular expressions which are applied to the list of files to be added. By applying these patterns on the filenames, the values forfileGrp
,ID
,pageID
etc. are derived.There are two major drawbacks to this approach:
How it should be
Instead of just filenames, allow users to prepare a whitespace-delimited list of fields to feed into
bulk-add
, either via command line arguments or by reading from STDIN to allow users to pipe a CSV, possibly created with a spreadsheet tool, into the CLI.With this approach, users still need to define a regular expression, but it is much simpler, essentially the header line of a spreadsheet, defining the fields.
Steps