Clean up and homogenize scripts code

bertfrees commented 6 years ago

Part 1

[x] Split scripts into load, convert and store steps. Name them all "px:foo.convert" and not "px:foo-convert".
[ ] Homogenize names of ports of component steps. E.g. "source.fileset" and "source.in-memory" and not "fileset.in" and "in-memory.in".
[ ] Possibly convert single-document ports and even anyFileURI options to fileset + in-memory ports.
[ ] Merge fileset and in-memory ports into a single port, where the first document would be the fileset and the rest the in-memory documents.

Part 2

[ ] Add input validation to all scripts (to load steps): https://github.com/daisy/pipeline-scripts/issues/112. There should be an option named "validation" with possible values "off", "report" and "abort" (see https://github.com/daisy/pipeline-scripts/issues/111).
[ ] Give all scripts a "status" output port with a d:status document (for now d:validation-status, see https://github.com/daisy/pipeline-framework/issues/121).
[ ] Give all store steps an output port with a c:result document.
[ ] Make sure all scripts have media-type set on all inputs, outputs and options (https://github.com/daisy/pipeline-tasks/issues/105)

bertfrees commented 6 years ago

@rdeltour @josteinaj Thoughts?

josteinaj commented 6 years ago

Sounds great!

Is it worth the time to rename all the "fileset.in" and "in-memory.in" to "source.fileset" and "source.in-memory" though? Also, you'd have to rename "fileset.out" and "in-memory.out" to "result.fileset" and "result.in-memory" for consistency.

bertfrees commented 6 years ago

Compared to the other items I think the port renaming will take relatively little effort. I just think that the "in" and "out" are redundant because you can see whether it's in or out by looking at the port type: p:input or p:output. But more importantly, I was thinking about what to do if you had multiple fileset inputs or outputs. In that case you want to give them more meaningful names. I'm planning to give steps that consist of several stages an output for each of the intermediary formats. This can be useful for debugging for instance. For example I already do this for dtbook-to-pef.convert: it has an output port called "result" and an output port called "obfl". In the case of filesets these ports would be named "obfl.fileset", "obfl.in-memory", "result.fileset" and "result.in-memory".

By the way I have also been thinking about merging fileset and in-memory ports into a single port, where the first document would be the fileset and the rest the in-memory documents. I think this could make the code less verbose and more readable. I didn't include this idea in this issue because it would be quiet a drastic change, but it might be something for the future.

you'd have to rename "fileset.out" and "in-memory.out" to "result.fileset" and "result.in-memory" for consistency.

Yes, of course.

josteinaj commented 6 years ago

Ok, yeah it makes sense.

Merging the fileset and in-memory ports would be nice. It would make it easier to connect steps, while in every step you'd start and end with some step to split/combine the fileset and in-memory parts. Maybe px:fileset-join could be extended and reused for the result and a new px:fileset-read could be created for reading the fileset and in-memory parts. This could be gradually introduced script-by-script, it doesn't have to be a big refactoring.

bertfrees commented 6 years ago

My idea was to combine the ports even in the fileset utility steps, so that normally you wouldn't even have to split/merge the sequences in the higher-level steps. (Note that this would be a big refactoring.)

rdeltour commented 6 years ago

Cleaning and harmonizing as you suggest sounds good to me! 👍

I'm just not sure about this one:

Possibly convert single-document ports and even anyFileURI options to fileset + in-memory ports.

For very simple steps it may be overkill to use the fileset + in-memory model, as it would complicate their usage?

bertfrees commented 5 years ago

Yes you're probably right.

However if we want to merge the fileset and in-memory parts, like described above, we will have both complex (combined fileset+in-memory) ports and simple ports, and it will not be immediately clear what kind it is.

daisy / pipeline-scripts

Clean up and homogenize scripts code #141

Part 1

Part 2