galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 999 forks source link

Tool syntax: Allow explicit specification of format for collections #7175

Open bernt-matthias opened 5 years ago

bernt-matthias commented 5 years ago

If I have a collection like this

<collection name="NAME" type="paired" label="${tool.name} on ${on_string}"/>

which I fill in the command block like:

mv FASTQ '$NAME.forward'

Galaxy needs to sniff the format. I guess it would be nice if the collection element woud have a format parameter for such cases...?

mvdbeek commented 5 years ago

Can you extend that example a bit ? I don't understand where sniffing comes into play. There is a format parameter for elements, but I guess that's not what you're looking for ? (https://planemo.readthedocs.io/en/latest/writing_advanced.html#static-element-count)

bernt-matthias commented 5 years ago

Elements of collections usually (mostly?) have the same format. In particular this is holds for paired collections. So if the tool is implemented as above, i.e. the elements of the collection are assigned explicitly (I don't know if this is possible for lists?), then no sniffing or detecting the outputs by regular expressions is necessary. But currently Galaxy does not allow to set the format of all elements of a collection.

But, you are right: I could also use something like ln FASTQ FASTQ.forward in the command block and then <discover_datasets/> or <data/> in the <outputs>.

In particular the latter seems equivalent to my use case. Would this be preferred?

Btw. the latter option (ie using data in collection) is undocumented in https://docs.galaxyproject.org/en/master/dev/schema.html#tool-outputs-collection

mvdbeek commented 5 years ago

I still don't understand, data and data_collection have a format that you can specify ?

mvdbeek commented 5 years ago

Oh ... that's not true. How was that never a problem ???

bernt-matthias commented 5 years ago

I guess in most cases <discover_datasets/> or <data/> are used which allow to specify the format.

mvdbeek commented 5 years ago

Hmm, data wouldn't work for lists. Try to avoid discover_datasets, this puts additional burden on Galaxy and limits what jobs can be queued ahead.

mvdbeek commented 5 years ago

I guess one factors why this isn't a bigger problem is that most of the time you don't need to explicitly construct lists, you'd just map over a normal input.

bernt-matthias commented 5 years ago

I also struggle to come up with an example for lists, because they usually have a dynamic number of elements, e.g. demultiplexers or file splitters.

But for pairs it could be handy.

mvdbeek commented 5 years ago

For pairs you can use data. The problem are list or lists of pairs.