Toolform support for selecting datasets from within collections

hexylena commented 9 years ago

My users are finding dataset collections to be not so user-friendly. They generate collections (e.g. sequencing data), then do the map step (assembly), and then are stuck not being able to access the data within collections. They want to do manual analysis of different files within that collection.

Within the "select single/multiple datasets" UI, it would be nice if collections were listed alongside (maybe bold) and then datasets within collections listed below the header and indented. Much like how timezones are used as headers here https://select2.github.io/examples.html

hexylena commented 9 years ago

@guerler anything I can do to help here? This feature is getting more important as I'm updating more tools to use collections... I don't want my users to be "locked in" to collections and unable to do the analyses they need.

guerler commented 9 years ago

@erasche I think its a great idea. Does this require any backend modifications? ping @jmchilton. If not we should be able to access the hda ids of collection components and then populate the single dataset select field at https://github.com/galaxyproject/galaxy/blob/dev/client/galaxy/scripts/mvc/form/form-select-content.js as you mentioned above. We should be careful regarding scalability since some collections might contain several thousand components.

jmchilton commented 7 years ago

@guerler I'm fairly certain the tool API will run just fine if given an individual HDA ID from within a collection. I can add a test case if you wish.

I would discourage doing this unless we do indeed fetch the collection elements on the fly and not with the initial request. Collections keep potentially huge histories small so the tool form for instance still works fine - it would be a real step backward to break those histories in order to populate the collection contents in this form.

As an aside - the workaround that people use I think is the un-hide the individual HDA that corresponds to the collection element they want to run a tool with. In an abstract way I do like that because it is declaring that you are indeed interesting in treating this dataset as a stand-alone thing - so it will be present for instance when this analysis is extracted from the history into a workflow. Without unhiding that dataset - this is just going to be a dangling input. When we get to loose with how we treat the contents of collection there is some conceptual tracibility or reproducibility we are loosing in my opinion - having the user declare inputs as inputs is a slight balance to that.

It is not to say we shouldn't do this - it is a high priority thing to me - I'd just place it at number 6 on the "collection" priority list after upload, re-running, naming issues, deletion, and improved state representation handling.

hexylena commented 7 years ago

I like collections because, as you rightly mention, they keep histories small.

As an aside - the workaround that people use I think is the un-hide the individual HDA that corresponds to the collection element they want to run a tool with. In an abstract way I do like that because it is declaring that you are indeed interesting in treating this dataset as a stand-alone thing - so it will be present for instance when this analysis is extracted from the history into a workflow.

Asking them to unhide things.. they'll just ask me why I forced them to use this cumbersome new feature if they're just going to have to unhide things. And I'll end up back where I started, with non-collection enabled tools because of what my users see as a UX issue. Or I make collection enabled tools and my users complain because of a) the changes, and b) having to run an "explode collection" tool or unhiding datasets + deleting the collection.

In an abstract way, yes, I agree, I also like users declaring "I am pulling this out of a collection".

The specific use case I have in mind is the entrez tools. I think everyone benefits from those being a collection output since 95% of people want to treat them as a giant blob.

Some want to merge them into one file
Some want to batch their analyses over the collection of files and speed up processing, and then merge
My boss uses the outputs from that tool to review genomes, as one-by-one process, running different tools on each genome based on what claims are made in the papers. In that case collections would help him keep his history tidy, but not if he can't run tools in individual items without exploding the collection.

guerler commented 7 years ago

I agree with @jmchilton. I like the process of unhiding too, although we might want to rename it into something like 'extract' to make it more apparent. On the other hand I understand @erasche's concerns. It makes working with collections less straightforward. However, just adding all hdas of all collections to the data selection list will likely lead to severe performance issues, see: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/parameters/basic.py#L1778.

hexylena commented 7 years ago

If perf is a concern, could the select UI behave like history? Select from this list or (click on to) enter a collection and see sub elements.

Aysam Guerler notifications@github.com schrieb am Mo., 27. März 2017, 20:44:

I like the process of unhiding too, although we might want to rename into something like 'extract' to make it more apparent. Although I understand @erasche https://github.com/erasche's concerns. It makes working with collections less straightforward. I am worried about the performance implications if we do it upfront for all collections by adding their hdas to the options list at: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/tools/parameters/basic.py#L1778 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy/issues/740#issuecomment-289580187, or mute the thread https://github.com/notifications/unsubscribe-auth/AAb_u053OeKRwh9awsS0QiEclQGmoQ69ks5rqB-dgaJpZM4F_O5Y .

mvdbeek commented 7 years ago

If perf is a concern, could the select UI behave like history? Select from this list or (click on to) enter a collection and see sub elements.

That would be a great solution. A small indicator that the item is a collection, and then a click on it will expand the collection to show and be able to select individual items.

jmchilton commented 7 years ago

I agree with @guerler that implementing this in the tool form is tricky - I think it should be done but the tool form hasn't been setup to do this easily.

We keep talking about dragging and dropping datasets into the tool form - this would probably be easier to implement (right @guerler?) and something we definitely want to do anyway since the history filtering and such on the side is very powerful already.

If people could drag and drop collection elements into this form what percent of the UX concerns would be addressed by that workaround - only 5%, 50%, 85%?

jmchilton commented 7 years ago

Some want to merge them into one file

Some want to batch their analyses over the collection of files and speed up processing, and then merge

My boss uses the outputs from that tool to review genomes, as one-by-one process, running different tools on each genome based on what claims are made in the papers. In that case collections would help him keep his history tidy, but not if he can't run tools in individual items without exploding the collection.

I understand that people want to deal with things in different ways for sure. Might it make sense to have different tools or different workflows? Ones that gave users in the first two scenarios collections and ones that produce individual datasets for the third use case. I think one of the strengths (or maybe just distinctions) of Galaxy's approach to that say that of CWL to tools is that we are aiming to produce little individual useful applications almost - not aiming to model a command-line tool and every configuration of its output. So if one tools produces different configurations of outputs or even if users may want to consume their outputs in different ways - it makes sense IMO to have different Galaxy tools for the same command-line.

MoHeydarian commented 7 years ago

I think a dataset within a collection should be unhidden, or "extracted" as @guerler suggested, prior to running a tool on it. Running the tool on a hidden dataset mitigates tracibility and transparency, and breaks reproducibility if trying to build a workflow from that history.

The current way to expose a single dataset from a collection is to unhide all datasets in the history, select the dataset to expose, and unhide it. This can be cumbersome and frustrating (to have your browser grind to a halt while the history refreshes) when your history contains thousands of items hidden behind a collection. As @mvdbeek suggested, having a selectable box when one clicks into a collection and being able to "extract" individual datasets would be my preferred way of unhiding/exposing single datasets.

I'm not sure exposing single datasets from a collection is good for reproducibility. When I need to interrogate single datasets from a collection, I use the Collapse Collection tool (appending the file/sample name to each line) to generate a single file representing a whole collection. From here I can use the filter tool to wrangle data from specific samples. These steps can all be done in a workflow.

hexylena commented 7 years ago

Lots of great points, thanks for having this discussion y'all. I know there are other, higher priority items on collections work, this is just the most visible one to me right now.

@jmchilton @guerler

We keep talking about dragging and dropping datasets into the tool form - this would probably be easier to implement (right @guerler?) and something we definitely want to do anyway since the history filtering and such on the side is very powerful already.

I initially thought "ugh, gross, DnD in the browser + moving my mouse all the way over when I'm vertically, linearly scanning the tool form." After some more consideration, I can see the logic in this, the history filtering is really good, maybe this makes sense to do. (I wonder about accessibility but we have bigger things to attack first on that topic.)

If people could drag and drop collection elements into this form what percent of the UX concerns would be addressed by that workaround - only 5%, 50%, 85%?

For me, that would solve my use cases completely.

I think one of the strengths (or maybe just distinctions) of Galaxy's approach to that say that of CWL to tools is that we are aiming to produce little individual useful applications almost - not aiming to model a command-line tool and every configuration of its output.

I am so completely in agreement with this, you have no idea. For non-bioinformatician users, the tools would ideally be useful and abstracted from the underlying implementation. They don't want to learn "Oh, I have to use a tool named Bowtie for mapping reads", I feel that they have that question and just want to see "map reads to genome" as a tool to make their foray into bioinformatics more obvious.

I have experimented with doing this (for my specific case, again), and it works OK, but I fear that it doesn't scale well since I have to do it on a per-tool/per-functionality case. Yes, I don't have to re-write the tool, but I now have two tools in the tool panel and my boss wonders which he should use and doesn't always use the right one.

@MoHeydarian

I think a dataset within a collection should be unhidden, or "extracted" as @guerler suggested, prior to running a tool on it. Running the tool on a hidden dataset mitigates tracibility and transparency, and breaks reproducibility if trying to build a workflow from that history.

Maybe it is just my reading, but it sounds like you think this should be an explicit action that a user takes, ahead of running the tool? Is that correct? If not: great, agreed, that's fine. (If so: Why does this have to be an explicit, additional step? If this is automatic / implicit, I'm fine with it. If it's explicit, then it's a cumbersome UX issue that requires user training, whereas if it's just "here's a folder (i.e. collection), look in there for datasets" then it's fine.)

I'm not sure exposing single datasets from a collection is good for reproducibility. When I need to interrogate single datasets from a collection, I use the Collapse Collection tool (appending the file/sample name to each line) to generate a single file representing a whole collection. From here I can use the filter tool to wrangle data from specific samples. These steps can all be done in a workflow.

That sounds like a workaround for the underlying issue. We have implicitly launched "convert" tools, surely we should treat the dataset extraction the same: launch a tool that extracts the specified collection element into its own dataset (or however that could happen without creating a duplicate file), as part of the described DnD setup?

Sounds like collapse collection tool only works on text files? (I'm trying to explore, but I don't use main and I seem to be far down the queue.)

MoHeydarian commented 7 years ago

@erasche

Maybe it is just my reading, but it sounds like you think this should be an explicit action that a user takes, ahead of running the tool? Is that correct? If not: great, agreed, that's fine. (If so: Why does this have to be an explicit, additional step? If this is automatic / implicit, I'm fine with it. If it's explicit, then it's a cumbersome UX issue that requires user training, whereas if it's just "here's a folder (i.e. collection), look in there for datasets" then it's fine.)

If a dataset within a collection can be chosen on a tool form and upon execution that hidden dataset is exposed/extracted/visible in the history, I think that would be great. I just think that the input should be visible after it has been used to allow tracibility.

Yes, the Collapse Collection tool only works on text files (for now and hopefully not too long), so I suppose the strategy I mentioned is kind of is a workaround, but in the case of working with single cell *-seq data it works great to operate on lots of expression tables (all text format).

hexylena commented 7 years ago

I just think that the input should be visible after it has been used to allow traceability.

Sure, this is fine! Glad I was mis-reading that.

jxtx commented 7 years ago

I've long thought we need some kind of advanced dataset picker in the tool form.

The default select list for a data parameter would only show datasets in the current history. Keep it small and simple.

Next to the select box would be a button that pops over a dataset picker that lets users browse in a more advanced fashion, including:

Digging into collections
Selecting datasets from other histories
Selecting datasets from data libraries

I also don't think the data needs to be added to the current history in any of these cases. We can provide ways to navigate the provenance graph.

Drag and drop should also happen of course. But doesn't solve important cases like libraries.

jgoecks commented 7 years ago

+1 @jxtx's comments. IMO, there is no need to deal with unhiding/exporting, we just need a more intelligent way to navigate and select datasets within collections and across histories. But:

I also don't think the data needs to be added to the current history in any of these cases. We can provide ways to navigate the provenance graph.

I disagree here. If we don't add datasets used to the current history, we are changing a fundamental aspect of Galaxy: the current history contains all provence for an analysis. Histories in this case become much less self-contained and more difficult to understand.

hexylena commented 7 years ago

Was talking to @bebatut about this issue today, she's using bioblend for this but that isn't a solution for a lot of users.

mblue9 commented 6 years ago

Is drag and drop of datasets within collections still being considered?

As that would be such an unbelievably helpful thing to have imo!! At the moment I keep running into this issue of having collections and not being able to get the datasets out. I've just been working around it by downloading the collection and then reuploading the individual files. But it's a pretty ugly step that I am really hoping I don't have to teach. Even if there will be other more expansive future solutions, drag and drop would be so helpful I think (like the way the histories panel drag and drop is so helpful I couldn't be without it now!)

hexylena commented 6 years ago

@mblue9 I believe @jmchilton wants to implement an "explode collection" tool https://github.com/galaxyproject/galaxy/issues/2496 but yes, being able to select (better, drag and drop) from within a collection would make it much more appealing for certain use cases.

Mataivic commented 6 years ago

Hello,

I was looking on informations about the "filter fail" tool of Collection Operations suite, and I found the following issue : #2496, which drove me here.

I'm interested in the "Extend the filter failed operation with option to filter empty datasets" improvement. I'm currently working with a tool which sometimes produces empty files which causes a crash on the following tool ; if filter_fail can remove empty files from a dataset collection, it would solve my problem.

Thanks in advance.

mblue9 commented 6 years ago

To update on my issue ^^ I tried unhiding datasets (suggested further up in this thread) but that doesn't work well at all in my situation, see screenshot below. As, when I unhide the datasets, they all lose the lovely informative name (element.identifer) I have tried so hard to carry along in this workflow 😞

lparsons commented 6 years ago

I totally agree with @mblue9. Simple put, I'm unable to create a simple RNA-Seq workflow in Galaxy. There are two major issues:

Unable to select datasets within collections
Dataset naming

mvdbeek commented 6 years ago

I have been doing RNAseq experiments on a semi-regular basis with collections and subworkflows for at least the last 1,5 years now and I have to say that with the filter collection tools I think you can do anything that is necessary for RNAseq experiments (while of course it could be improved).

screen shot 2018-01-25 at 09 46 04

I also have a variant of this with salmon. So personally what would be highest on my wish-list would be the ability to more tightly integrate the collection filtering tools with the UI, so that you don't have to think in advance about the structure of your collection or upload a text file to be used in the filter collection tool.

mvdbeek commented 6 years ago

(I'll try to write up something on how one can figure out a good "structure" for an analsys workflow). There's also https://usegalaxy.org/u/marius/w/parent-workflow-chipseq that implements a similar pattern for ChIPseq, so I think this way of designing your workflow and inputs should apply to more cases.

jmchilton commented 6 years ago

@mvdbeek

So personally what would be highest on my wish-list would be the ability to more tightly integrate the collection filtering tools with the UI, so that you don't have to think in advance about the structure of your collection or upload a text file to be used in the filter collection tool.

This is exciting to hear. Given that I've been building a hammer lately everything looks like a nail to me so my initial proposal for this would be merging #5365 and then implementing the third bullet item on #5381 (Apply Collection Builder to Collections). My initial thinking in that issue was it would be a good way to re-organize collections but it would be just as good at filtering right away I think given what has already been implemented in #5365. I think this is a cool approach but I'll admit it isn't obvious what I'm trying to say without me having a prototype ready to demonstrate. Want to check it out and let me know if you can imagine it being a good approach or if I'm not clear or you'd like to see something else can you sketch out a new issue describing what you would like to see and what functionalities it should have?

mvdbeek commented 6 years ago

I can absolutely see that from the PR description / screenshot, yes. I was going to ask if we can apply this to existing collections as well, so that's cool!

lparsons commented 6 years ago

Thanks for the workflow example @mvdbeek. That will work well for simple experiments, but most of the real ones I've come across have additional factors to include in the DESeq2 model (e.g. batch, individual, etc.), which requires the user to select a different grouping of the samples that isn't reflected in the collection organization. Thus the need for this issue.

mblue9 commented 6 years ago

Totally agree with @lparsons. I think @mvdbeek your suggestion is good in theory and for some situations. But the user may also need to be a workflow master like yourself as that workflow looks a bit scary to me are they subworkflows you're got in there? I haven't even got a simple one to work fully yet with names! I just tried use collections from the beginning of a workflow and have still ended up in this mess below and it is just making me want to cry right now.

mvdbeek commented 6 years ago

That will work well for simple experiments, but most of the real ones I've come across have additional factors to include in the DESeq2 model (e.g. batch, individual, etc.)

So my example includes the batch effect, you can see that if you trace the connections for factor 2. So that's treatment and control (factor 1), with a A/B, C/D pairing, where sample-prep for A and B were prepared at the same time. Individual pairing would be possible as well, but you'd need to split up your collection accordingly (that for example is not as straightforward as it could be). I haven't done time-course analysis yet, (happens to be something I'll do today), so that may actually not be possible, but then that would be a limitation of the DESeq2 wrapper.

I do real analyses here, and the fact that I'm able to do it does of course not mean it is as simple as it could be.

which requires the user to select a different grouping of the samples that isn't reflected in the collection organization.

I touched on this above, but I doubt dragging from a collection will work reliably for a multi-factor analysis with multiple replicates. That is going to be very error-prone, and also not generalizable to a workflow. But yes, that is something to work on.

mvdbeek commented 6 years ago

@mblue9 the issue is now that you can't identify what the collection represents ? Tagging them is a good start (that should work in 17.09), and then @jmchilton also fixed the rename output operations for collections in the workflow, so that will hopefully be a breeze in 18.01

lparsons commented 6 years ago

@mvdbeek My apologies, I see now. I guess the issue is that you have to create a collection for every combination of factor levels, which isn't too practical for a lot of experiments and makes the workflow almost more trouble that it's worth (esp. when it comes to having to add rename actions, etc.) However, perhaps some of the changes made in 18.01 will help?

I doubt dragging from a collection will work reliably for a multi-factor analysis with multiple replicates

It seems to me that hashtags are great for handling factor levels. If there was a tag for each level, and I could somehow tell the workflow to use things from the collection with a specific tag...

In the meantime, being able to select from within collections would a manual workaround. I just don't see people setting up a workflow first, then running it for something like this. Instead, people create a single align and count workflow, run it on every dataset in a collection, and then manually run DESeq2, picking the datasets and factors they want. The workflow seems MUCH more complicated and difficult to setup for a one time use.

mvdbeek commented 6 years ago

The workflow was just a graphic way to demonstrate what you need to do. You can also do this without separating the elements up front. So how about another tool that splits collections by tags, would that help ? (we've had that request before, I think.)

lparsons commented 6 years ago

Seems unnecessary to create all these additional collections when one could simply specify which subset of a collection should be used for a specific input. It makes working with collections very cumbersome. Why the reluctance to allow users to treat collections like folders?

jmchilton commented 6 years ago

Why the reluctance to allow users to treat collections like folders?

If researchers are pulling stuff out of collections and filling in boxes by hand - there is some metadata they are leveraging to do that - maybe in the name, maybe in a sample sheet. If the researcher knows how to access that metadata - Galaxy should make it possible and easy for the researcher to convey that information to the collection and should make it intuitive and easy to use that to map that set of files to the tool in an abstract way the is extractable and trackable and reproducible. Missing the modeling of that metadata means an important part of the analysis is not be captured by Galaxy and the analysis is missing important stuff in terms of reproducibility and accessibility. I understand the nitty gritty is difficult and the user experience of collections is rough in many ways currently - but these are the lofty goals.

If you are detecting reluctance to treat collections as folders - it is because they weren't meant to be used that way, it skirts the problem I was hoping collections would solve, and I ultimately think people will be unhappy if they use collections this way - even if we make it super slick. Collections are terribly rough in so many ways - but I'd rather be working on solving the problems they were meant to solve than building a folder structure into histories. This may be a mistake - it may be that capturing that metadata is too hard, building a UI for bridging that metadata from the research to Galaxy and then from Galaxy to the tool form and job structure is too hard, but the reluctance comes down to that being the goal. That is what at least I am trying to do.

I hope that is understandable - I also hope you understand the reluctance is not an unwillingness. No one has ever rejected an enhancement in that direction and I even opened this PR for you.

mblue9 commented 6 years ago

@mblue9 the issue is now that you can't identify what the collection represents ? Tagging them is a good start (that should work in 17.09)

Yes that would be the current issue, I've now no idea what's inside each collection thanks to those cryptic "x on y" names.

How are you tagging? As I just tried this workaround and added a tag to each collection of fastqs that I have (12 collections) but then when I went to run the workflow just now, I've ended up with a history for each collection! so 12 histories!! Is that expected? I would have much preferred just one! As I'm working with multiple types of data at the moment and for multiple users so 12 histories for just one dataset is way too much imo. Do they have to split on the tag?

mvdbeek commented 6 years ago

Did you enable "send to new history" ? That is not the default behavior

On Feb 2, 2018 8:35 AM, "mblue9" notifications@github.com wrote:

@mblue9 https://github.com/mblue9 the issue is now that you can't identify what the collection represents ? Tagging them is a good start (that should work in 17.09)

Yes that would be the current issue, I've now no idea what's inside each collection thanks to those cryptic "x on y" names.

How are you tagging? As I just tried this workaround and added a tag to each collection of fastqs that I have (12 collections) but then when I went to run the workflow just now, I've ended up with a history for each collection! so 12 histories!! Is that expected? I would have much preferred just one! As I'm working with multiple types of data at the moment and for multiple users so 12 histories for just one dataset is way too much imo. Do they have to split on the tag?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy/issues/740#issuecomment-362508915, or mute the thread https://github.com/notifications/unsubscribe-auth/AGfVpRPJ2NDX_F938sSU0MEhhBOwddVSks5tQrrDgaJpZM4F_O5Y .

mblue9 commented 6 years ago

Yes I did Send to a New History as I already had a history full of those x on y. And I had just realised that was cause.

So looks like Sending to a New History is a big NoNo if you have tags on your collection, if you don't want them in a separate history for each collection. Sending to a New History worked differently without the tags so yes, this to me is unexpected non-obvious behaviour.

mvdbeek commented 6 years ago

So looks like Sending to a New History is a big NoNo if you have tags on your collection

Been frustrated by this as well, but that has been this way when you select multiple inputs to a workflow. Clearly that's not ideal, but this is independent of tags

mblue9 commented 6 years ago

Ah ok, yes I think I had not been Sending to New History when not using tags. I don't want to speak too soon...but... this looks like it might work! Or at least be a big improvement on what I had. This is what I've done that's looking promising:

split samples into collections e.g. per group
add a tag (sample label) to each collection at the beginning (e.g fastq stage)
run workflow (but DO NOT Send to New History unless you actually want a history per collection)

mvdbeek commented 6 years ago

That's one way to do it, yes! I'm now checking if we can also use nested collections just until the point where individual collections are needed, e.g. screen shot 2018-02-02 at 09 45 25 That would probably be easier to understand when you look at the history, instead of having parallel collections

mvdbeek commented 6 years ago

Alright, it is now possible to drop datasets from collections into the tool form with #5657 being merged.

hexylena commented 5 years ago

This is finally resolved with https://github.com/galaxyproject/galaxy/pull/7553! :tada::tada::tada::tada:

galaxyproject / galaxy

Toolform support for selecting datasets from within collections #740