galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 992 forks source link

internal data converters should keep the original display title #4654

Open bgruening opened 7 years ago

bgruening commented 7 years ago

For tools that consume the display name, for e.g. plotting, it would be nice if the internal implicit converters could use the display name of the input dataset and not change this.

bgruening commented 7 years ago

ping @dpryan79 @joachimwolff

mvdbeek commented 7 years ago

Is that a workaround for problems with collections?

dpryan79 commented 7 years ago

@mvdbeek An example would be taking a bedGraph file as input into computeMatrix from deepTools. Galaxy will use its converter to go from bedGraph -> bigWig, but then the sample name is lost and you end up with something like "bedGraphToBigWig on dataset 1234" as the display name. That then makes it really hard for people to track which sample is which downstream. If these just did "{tool.name} on display_name" (or whatever the correct way to write that is) as the output label then people wouldn't accidentally swap samples later on (well, it'd at least make them less likely to do so).

In the case of computeMatrix you often have a lot of these conversions in parallel, so it's really easy for the user to lose track of what file is which sample.

mvdbeek commented 7 years ago

Alright, that should be fairly simple to fix in the dataset converters themselves (better defaults that is). But there wouldn't be any problem if you use collections and then use the collection elements element_identifier, with the added benefit that you have to try hard to mix up samples.

mblue9 commented 6 years ago

Alright, that should be fairly simple to fix in the dataset converters themselves (better defaults that is)

Really?

This would be a really great thing to fix imo - to have more interpretable names from these tools both for datasets within collections, and also datasets outside collections. Because not all tools work well with collections at the moment (in my hands anyway), especially if you're trying to also incorporate MultiQC, like the issue described here: https://github.com/galaxyproject/tools-iuc/issues/1658

mvdbeek commented 6 years ago

https://gist.github.com/anonymous/4fc591da23912d51806764a8875cd48f should work to at least keep the name reference, but implicit conversions don't work currently for collections. I think that's one of the last things collections can't do at the moment.

Because not all tools work well with collections at the moment (in my hands anyway), especially if you're trying to also incorporate MultiQC, like the issue described here: galaxyproject/tools-iuc#1658

MultiQC works beautifully with collections, some time ago I shared a workflow on gitter for this, I can try to dig this out if you want. The issue is that fastqc doesn't work for paired end data, nothing we can do on the galaxy side about this, except for wrapping something that does work for paired end fastqs. The (not so great) workaround I use is to analyse the reverse reads, which you can get from a list:paired collection with the unzip collection tool.

If there are other tools that don't work with collections we should fix them. I have been using collections exclusively for at least 1,5 years. While there are some rough edges (see lacking implicit conversions) I could not possibly go back with the amount of samples we have.

mblue9 commented 6 years ago

MultiQC works beautifully with collections, some time ago I shared a workflow on gitter for this, I can try to dig this out if you want.

YES PLEASE @mvdbeek !! 👍 👍 😃 🎉 Any examples you have of workflows would be SO helpful to see! and save me time I could spend on tool wrapping instead 😄

mvdbeek commented 6 years ago

I've used this one before: https://gist.github.com/mvdbeek/d9d31691b0301a6d320bd83da5dd39b7

But some of the tool versions have changed in the meantime, and I haven't re-run it to test that all still works as it should.

mblue9 commented 6 years ago

Great thanks @mvdbeek! I'll check it out

mblue9 commented 6 years ago

I've been trying to figure out which of the Galaxy genome browsers works best for quickly viewing bunches of BAM files contained in collections.

So far I've tried IGV, Trackster, BAM iobio and UCSC Main, and atm UCSC Main is looking the most promising in my hands.

The one thing that would be really great to have though, is displaying the identifier instead of "Tool on data N", as I have to go and doublecheck which sample is "Tool on data 21", "Tool on data 19", "Tool on data 11" etc...

Is this the issue here:

but implicit conversions don't work currently for collections. I think that's one of the last things collections can't do at the moment.

As wondering if there are any plans to implement that at some stage? Or do I just have to live with this. It's not the end of the world but if there was a way to display the identifier that certainly would make life a bit easier.