galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
160 stars 419 forks source link

DESeq2 wrapper: how to improve it? #2239

Open bebatut opened 5 years ago

bebatut commented 5 years ago

Hi all,

@mvdbeek did a nice job recently to implement more functionality to the DESEq2 wrapper. Some functionalities are still missing and would be useful to satisfy more users, like handling interaction factors

A discussion started on Gitter with @lparsons and @mvdbeek. What about continuing it here? @pavanvidem and others may be interested to join the discussion too :smile:

Can we come up with a plan? Should we have a call to discuss about it (and write the notes of the meeting here)?

Bérénice

nsoranzo commented 5 years ago

Ping @mblue9 too.

mvdbeek commented 5 years ago

This is the list that comes to mind, feel free to edit:

mblue9 commented 5 years ago

Thanks @nsoranzo.

I could try to adapt edgeR and limma-voom similarly as would be better for users if the tool forms are consistent where possible.

lparsons commented 5 years ago

Ping @hepcat72.

hepcat72 commented 5 years ago

I was just using DESeq2. I swear there used to be an option to not output a results file. Perhaps I'm mis-remembering, but here's where that's useful:

If you have a number of comparisons to make, but for QC purposes, you want to run all the factors with all the samples to see how the plots turn out in terms of PCA, clustering, etc., you don't really need to use the results for any sort of analysis - the only purpose of the run is for QC. Thus, after I started the job, I immediately deleted the results dataset to clean up my history.

Other things which I find cumbersome (which you may have already solved? - I admit I haven't read the above)...

In the above use case, (for the one I just ran), I had 3 factors with 3, 2, and 3 factor levels. They were strain, treatment, and run respectively. I needed to know which factor contained the most variance. Usually, to run these things, I need to unhide my featureCounts count datasets that were produced by a workflow in a collection (as there's no way to select individual datasets that reside inside a collection) and creating multiple overlapping collections is more labor intensive than just selecting individual datasets.

Searching through hidden datasets to unhide the datasets in the featureCounts collection is cumbersome because the names are interspersed with summary and lengths datasets similarly named. Perhaps I could have used the search feature - I always forget to do this.

I also need to rename them (from the generic "featureCounts on dataset..." names) using the column headers that represent the original sample names in the input collection so I can easily select them in the DESeq2 interface. (I should have used the rename dataset, but I usually neglect to do this. An easier way to retrieve the original sample name would be very useful, regardless of renaming.) Perhaps the group tags will solve this.

bimbam23 commented 5 years ago

Hi it would be nice to have the same option of using a single count table. In addition is it possible to use single files from a data collection?