galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 999 forks source link

Set element_id for data in paired collections to be a join of the sample identifer and paired end indicator (e.g. "sample_forward" and "sample_reverse") #7533

Open lparsons opened 5 years ago

lparsons commented 5 years ago

It appears that the element_identifier for data in a paired end collections is always forward and reverse, dropping the outer collections identifier. This is problematic at times (e.g. running FastQC and then MultiQC, see https://github.com/galaxyproject/tools-iuc/issues/1595).

There is a workaround to first flatten the collection (which creates <sample>_forward and <sample>_reverse, but it seems like it might be nicer to do this automatically and have element_identifier for paired collections be set to this concatenated value automatically. Ping @jmchilton.

mvdbeek commented 5 years ago

I agree that a convenience function that also includes all outer element identifiers would be useful. Note that accessing $some_input.element_identifier always gives you identifier of $some_input. Nothing stops us from changing the multiqc wrapper to do $input_collection.forward.element_identifier, except that the fastqc / multiqc combination just doesn't make a lot of sense for paired end data.

mvdbeek commented 5 years ago

(There also some extended discussion in https://github.com/galaxyproject/tools-iuc/pull/2028 regarding the fastqc problems)

lparsons commented 5 years ago

I'm not sure that modification of the MultiQC wrapper would work that well, as one would only see a single row of data for each sample from FastQC, and there really is two (and IMHO, should be show separately, see below).

Is there a way to get the "parent" element identifier in the FastQC wrapper when the input is a paired:list? That would make the most sense to me, and would make the FastQC output at least make sense in a MultiQC output.

In other words, a modification of https://github.com/galaxyproject/tools-iuc/blob/270bb876857d700ecc7fb9d1757c63dcfeb401aa/tools/fastqc/rgFastQC.xml#L16 so that we get more than just forward or reverse when running in a paired:list. I'd be happy to put together a PR, but I'm not really sure how to get that element_id or if it's even possible.

image

mvdbeek commented 5 years ago

I think if you have it as in the screenshot it's only helpful if you have a handful of samples. Did you try https://github.com/galaxyproject/tools-iuc/pull/2028/files ? I'm pretty sure it does what you'd like to do, except it only generates a report for the reverse read.

lparsons commented 5 years ago

I agree that it's not as useful as the number of samples goes up, but it is useful in some cases. If it was easy enough to modify the FastQC wrapper to get the "parent" element_id, it wouldn't make anything any worse, but at least it would avoid the problem of MultiQC showing every sample merged together as forward and reverse when it's simply being used to summarize a FastQC report. Is that doable?