Make Tools Collection-Aware

bgruening commented 9 years ago

@jmchilton recent work on workflow scheduling and data collections will change a lot for the galaxyp project and we should make our wrappers collection-aware. Maybe we will have soon workflows in workflows and loop like structures. This Issue should track our progress on this. Please add all tools that need to be adopted and one really complex example workflow.

[x] fasta_merge_files_and_filter_unique_sequences

trevor commented 9 years ago

What would it take to make them collection-aware?

From what I've heard all of the present tools have been successfully used with Dataset Collections. Is that not the case, or are there other enhancements you're thinking of?

bgruening commented 9 years ago

For example such things are now possible: https://github.com/galaxyproteomics/tools-galaxyp/commit/e9dc786acdc5702a0327f2ed0fd77538acfc71ef It will also be possible to create collections and as soon as the loop features arrives in the UI we need to test it.

jmchilton commented 9 years ago

@bgruening Many map-reduce style workflows should work with the variant of dataset collections I added last summer. Scaffold, ProteinPilot, etc... already use the multiple="true" - from a tooling perspective collections are not so different from multiple file datasets - so one should be able to do things like take a list of RAW files, map msconvert over them to produce a list mzml files, map an identification program over them, and reduce the results with something like Scaffold, and continue with a unified output.

I see very few applications in proteomics for output collections that were recently added (maybe this is what you mean by looping) - the one tool I have encountered in proteomics was say the MapAlignerIdentification (http://ftp.mi.fu-berlin.de/pub/OpenMS/documentation/html/TOPP_MapAlignerIdentification.html) - which is a true N->N operation.

trevor commented 9 years ago

For example such things are now possible: e9dc786

@bgruening — a hopefully small request after glancing at that branch. If possible could you aim to keep pull requests / commits (1) logically divided or (2) small and focused to a specific tool.

By (1) I mean if there's a mass change to indentation across many tools that's fine as a big patch, but it shouldn't have any code changes or semantic alterations to text itself. And by (2) it's easier for others to view a history and alterations to a specific tool matched with a specific commit. It's also easier to bisect the history if an unforeseen bug is introduced.

Obviously there's exceptions, decide as needed of course.

bgruening commented 9 years ago

@trevor sure I always try to do. Is this one https://github.com/galaxyproteomics/tools-galaxyp/commit/e9dc786acdc5702a0327f2ed0fd77538acfc71ef not small enough? Or do you want me to created separate PR?

@jmchilton Until know I found only one tool and converted it. If this is the only tool we can close the bug. I'm not finished yet with cleanup stuff according to our new https://wiki.galaxyproject.org/Tools/BestPractices page. OpenMS has a few tools like MapAlignerIdentification. For example: MapAlignerPoseClustering, SpecLibSearcher, MapRTTransformer ...

trevor commented 9 years ago

@bgruening

I (usually) try to group together: one conceptual change, in one commit (with one or many files), in one pull request, with a unique description.

Given that, e9dc786 looks great!

jmchilton commented 9 years ago

@bgruening If I were writing a wrapper for SpecLibSearcher say - I would just take in a single file and produce a single output (http://ftp.mi.fu-berlin.de/pub/OpenMS/documentation/html/TOPP_SpecLibSearcher.html). If users want to run it over a collection they can (going back several releases) and Galaxy will distribute it across many jobs, do the sample tracking, work more robustly in workflows, etc.... Unless there is real evidence that SpecLibSearcher uses all the inputs to produce all the outputs. When I talked with Oliver Kohlbacher in Berlin two years ago - his recommendation was not even to implement something like output collections and he told me that they were trying to rework the OpenMS tools so that none of them required the functionality.

trevor commented 9 years ago

@bgruening I'm assuming this can be closed now, and reopened if needed?

galaxyproteomics / tools-galaxyp

Make Tools Collection-Aware #12