common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
333 stars 230 forks source link

Enable ontology support for directories in addition to File and File[] #1617

Open kannon92 opened 2 years ago

kannon92 commented 2 years ago

Hello,

We work in the image processing domain and we would like to be able to guarantee that a user has the correct ontology before submitting a workflow.

I posted this on the discourse forum and it turns out that ontology is only working for File and File[]. We have a separate issue to log a warning if someone is using format field with a Directory class.

I'd like to request the ability to use ontologies for directory.

kinow commented 2 years ago

Hi @kannon92 ! :wave:

How do you see the ontology being used with a directory? Would it be something like myonto:ImagesListDirectory be applied to a Directory to validate that the directory contains a certain list of files?

I think if we do a few iterations, exercising how we would apply the ontology to a directory, we might be able to either come up with some possible future implementations, or decide to use some other workaround to validate the directories.

kannon92 commented 2 years ago

I am not sure if ontology checking actually looks at the files.

I just want to enforce a failure if somehow specfies an output of a certain format and they use it in another input but the formats don't match. That should be an error.

And yea, I think eventually we would want to adopt something like [EDAM-BIOIMAGING] (https://bioportal.bioontology.org/ontologies/EDAM-BIOIMAGING) for the format field. But right now, we will be okay with a custom list of allowed file formats on each command line tool.

tetron commented 2 years ago

@kannon92 possibly you could use secondaryFiles? If there's something you can designate as a "primary" file (usually an entry point or manifest of some sort) you can have all the other files and subdirectories that appear along with it tied in as secondary files.

kannon92 commented 2 years ago

Hello. I don't know if that really helps me out. In my case, these are typically large amount of image files that are all of a specific format.

mr-c commented 2 years ago

Hello. I don't know if that really helps me out. In my case, these are typically large amount of image files that are all of a specific format.

Is each file a specific format? Or are they all of the same format?

Is there a specification of how this directory is supposed to be structured?

Are there subdirectories?

Do multiple apps use this structure?

Do you need to construct this structure from different parts of the workflow, and/or decompose it for later steps? Or is this directory structure provided as an initial input and some later version of it as a final output?

From a typing perspective, another option is a custom record type with entries for the different types of files (each entry being of type File or an array of Files, both having a format field; or an entry that is another custom record type with its own fields). Some entries can also be optional and/or have secondaryFiles with specific variations on the file suffixes. At execution time this custom record type could be transformed into a directory hierarchy with a particular layout and naming scheme using InitialWorkDirRequirement, but that could get a bit cumbersome.