galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.39k stars 1k forks source link

Eject (or split) dataset collection operation #3870

Closed nekrut closed 3 years ago

nekrut commented 7 years ago

In some cases it is necessary to gain access to collection elements individually. For example, in my ChIP-seq analysis I initially bundle all data (signal and control) together into a single collection to pre-process, map, and post-process. However, when I run MACS it requires me to load signal and control separately. To enable this it would be necessary to have one of these:

bgruening commented 7 years ago

While I agree on the need for such functionality in a worst case, I do think that in this case the more correct thing would be to create two collections for your signal and control. Imho we should only aggregate files that belong functional together.

nekrut commented 7 years ago

then we really need to allow multiple select for collections in tools, so you can run on multiple collections

jxtx commented 7 years ago

Record types help a lot here. In my CWL chipseq workflow I have a list of replicates, each of which is a record of treatment and control, each of which is a (optionally paired) fasts. I think this leads to the most natural representation of the workflow.

@jmchilton and I have discussed this and he is going to rough out an idea of record types for Galaxy (which would presumably subsume the current "paired" collection.

On Tue, Apr 4, 2017 at 5:32 PM Björn Grüning notifications@github.com wrote:

While I agree on the need for such functionality in a worst case, I do think that in this case the more correct thing would be to create two collections for your signal and control. Imho we should only aggregate files that belong functional together.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy/issues/3870#issuecomment-291638304, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE4ZRtAxZak5JBFFpZRHxahPjzbGczpks5rsrb-gaJpZM4MzdJ3 .

nekrut commented 7 years ago

Aha, yes, but in the short term splitting collection would be nice

jxtx commented 7 years ago

But last resort ;)

jxtx commented 7 years ago

(Because it is hard/impossible to do reusably or reproducibly. You don't know which elements are treatment and which are control when you explode a collection interactively, so you can't make a workflow, record types address this)

bgruening commented 7 years ago

No clue about the client side, but I think selecting multiple collection at once would help you here more than this last resort tool.

nekrut commented 7 years ago

Yes indeed. so multiple select then

jxtx commented 7 years ago

Oh, don't close, last resort but still worth having. On Tue, Apr 4, 2017 at 5:52 PM Anton Nekrutenko notifications@github.com wrote:

Closed #3870 https://github.com/galaxyproject/galaxy/issues/3870.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/galaxyproject/galaxy/issues/3870#event-1029251143, or mute the thread https://github.com/notifications/unsubscribe-auth/AAE4ZQcvYtVHI09_mRMSMF555Yhr0AWVks5rsrubgaJpZM4MzdJ3 .

hexylena commented 7 years ago

Similar to https://github.com/galaxyproject/galaxy/issues/740, any solution for this would solve my issue as well :D

Takadonet commented 7 years ago

We find having the ability to eject or split would greatly improve our biologist ability to do their work. Specific example would be the SNVPhyl workflow where all samples are used in a single analysis. https://snvphyl.readthedocs.io/en/latest/ , Sometimes the only way to know if one or more samples have to be removed from the collection is at the end of the workflow.

The end user has to then re-make the collection without those samples and re-run. Issue is that sometimes they have to remove dozens or hundreds by hand. Since paging was added (so happy for that!), it makes almost impossible to select all the files again if there is more then 500 files in total.

I don't think the 'eject' tool should be ability in workflow execution but it should be available.

jmchilton commented 7 years ago

@Takadonet Can you use the filter failed tool to automate this? Or put another way - how are users selecting these datasets?

Takadonet commented 7 years ago

@jmchilton . Based on the output results from either a phylogenomics tree or based on values in secondary dataset. Example be all sample that have less then 60% identity to the reference should be removed.

jmchilton commented 7 years ago

@Takadonet Can you implement a tool that will just fail outputs that don't meet these criteria and then use the "filter failed" tool?

If it makes sense for your workflow to have a human involved - that is totally - but I'm always looking for guinea pigs to utilize new workflow functionality.

Takadonet commented 7 years ago

@jmchilton Seems to me that both cases would be needed. One case would be where human involvement is used and to me should be the same interface as creating a new collection so it is consistent.

Other case should be in a tool that is similar to the ones already in the base Galaxy codebase. i.e merge collection, unzip, zip etc... No point having a normal toolshed because of the duplication of datasets. Having the new tool available during a workflow execution would be awesome but difficult to implement for sure.

We are always up for being a guinea pigs!

jmchilton commented 7 years ago

@Takadonet Good points - I have a PR to add a filtering option that works without dataset duplication here https://github.com/galaxyproject/galaxy/pull/3940. Hopefully it will be in 17.05 - then all you would need to do is write a tool that looks at whatever metadata is interesting and builds a list of identifiers only of those you wish to keep.

Takadonet commented 7 years ago

@jmchilton Probably cherry pick into our current Galaxies ASAP. Got lots of users that would be interested for sure.

alexlenail commented 6 years ago

Sorry I'm a little lost between the multiple issues for this issue: What is the current status of being able to run tools on subsets of collections? If that isn't possible, is there a way to "eject" collections into a bunch of unique history items?

eschen42 commented 6 years ago

I would like to be able to copy a few datasets from a list of datasets. Specifically, I have a list of over mzML datasets, and I want to extract the dozen that represent the pooled samples. In the History UI, I can choose "Copy Datasets" and choose from the datasets in the history, but when I click on my list dataset so that its contents are revealed and the rest of the history is hidden (i.e., the history pane says "back to (my history)" and "a list with (count) items"), when I choose "Copy Datasets", it shows the datasets in the enclosing history.

Having "eject" would give me a workaround at least. Alternatively, if "Copy Datasets" worked for choosing members from list contents, then copying the members to the enclosing history would have the same effect as eject. Right now my only choice is to download (or find my original files) and upload.

@dannon I thought that it made better sense to comment here than to open a new issue since this seems so closely related.

hexylena commented 5 years ago

@nekrut the phrasing in your original post was very interactive, so for this case is it now resolved with https://github.com/galaxyproject/galaxy/pull/7553?

hexylena commented 3 years ago

I think this is mostly solved with the ability to filter by element identifier, and to interactively select in the tool form. I'm going to close this but please let me know if it's still not resolved and we should re-open.