galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 997 forks source link

Workflow genome selection inconsistent with few bugs: fixed vs runtime #1132

Open jennaj opened 8 years ago

jennaj commented 8 years ago

Few issues, some are tool related, but might be global.

Example tools: Tophat2 and SAM-to-BAM

Impact:

Detail Tophat2:

Detail BAM-to-SAM:

Graphics demonstrating behavior:

Workflow Editor

Workflow Submission

Workflow View

jennaj commented 8 years ago

Related https://github.com/galaxyproject/galaxy/issues/979

guerler commented 8 years ago

@jennaj Thanks a lot for all the details and screenshots. I have tried to answer all questions:

1) "SAM-to-BAM genome selection can only be done within the editor, manually, without benefit of an active list of dbkeys." Response: The genome selection is configured (in the tool xml) such that dbkeys are filtered by the metadata of the selected dataset. The list is empty because no dataset can be selected in the workflow editor.

2) "There is no way to select a genome from the history." Response: The workflow editor does not support the selection of datasets from the history.

3) "No prompt is given to the user that if a custom genome is to be used, it must be converted to a custom build first, and that the dbkey must be a known (due to no active dbkey list)." Response: This is an enhancement. We should have a separate git issue for this.

4) "Tophat2 genome selection can only be done from within the editor and when accessing a built-in index (does not include custom builds). Changing a fixed genome selection at runtime is not possible. Selecting a custom genome directly from the history is possible at runtime." Response: This is a bug of the current run workflow tool form and will be resolved once we switch to the upcoming revised version.

5) "In editor, if an original extracted dbkey is altered to be set at runtime, it cannot be recovered when switching back to fixed mode" Response: This is an enhancement we should implement. It makes this feature more user friendly.

6) "In editor, the dbkey is manually entered and can be left blank which will always trigger an error - Bug?" Response: Yes, its a bug. The current run workflow form and/or the web-controller should properly validate the inputs. See also: 4.

7) "In view (not submission form), if the genome is not fixed and to be selected at runtime, option does not state Select at runtime for all tools (example tools compared below in graphics). Editor does NOT allow option for runtime genome selection from the history - Enhancement?" Response: Yes, its an enhancement. This is related to #979. It seems that the challenge here is that conditionals and repeats often alter the number of input datasets and the run workflow tool form is unable to account for new inputs and workflow pipes. However, we should at least enhance this and allow conditionals/repeats without dataset implications to be set at run time.

Thanks again.

jennaj commented 8 years ago

Thanks Sam. I think we are in the same place. Plus I can clarify a bit. When I say make a ticket - that is my suggestion. If we all think that the SAMtools suite should be different, then that is something we can decide. It has usage implications that our community has already run into. And will more often with the generic workflow modeling (migration to open source shared workflow repos).

(github ticket is a placeholder for those we decide to actually create then link back here)

1) "SAM-to-BAM genome selection can only be done within the editor, manually, without benefit of an active list of dbkeys." Response: The genome selection is configured (in the tool xml) such that dbkeys are filtered by the metadata of the selected dataset. The list is empty because no dataset can be selected in the workflow editor.

Yes, but I think that we should consider turning that off for this tool set. No genome filtering based on inputs. Why? If this tool is included it freezes the workflow to a specific genome, unless one has edit permissions and want to change the workflow itself. As before, and a biggie, Galaxy is now being abstracted with other workflow engines. Plus, most of our other tools allow genome selected/modification at run-time. (except when it is not working - the editor AND the run-time form indicate that genome selection can be made in either for certain tools, like Tophat - which is why I included it as an comparision example). Does that makes sense? This could mean a github ticket.

2) "There is no way to select a genome from the history." Response: The workflow editor does not support the selection of datasets from the history.

It doesn't for this tool set, but it does for Tophat. This tool should be like the other. Actually, all tools that have a genome selection function should have this be the same (including the part about using a genome from the history - within the editor - that was in another observation I made). This is probably the same github issue as for 1)

3) "No prompt is given to the user that if a custom genome is to be used, it must be converted to a custom build first, and that the dbkey must be a known (due to no active dbkey list)." Response: This is an enhancement. We should have a separate git issue for this.

I agree - github issue. Might be the same as 1) and 2). In effect, make the ref genome selection just like Tophat and most other tools. From a list of indexed genomes, same as if using it in the history (and without any filtering by inputs). Workflows are not always created from histories and we should support the design of workflows that have as few hard-set parameters as possible. Ref genome is a biggie to have hard-set. Prevents a ton of valid/common use cases.

4) "Tophat2 genome selection can only be done from within the editor and when accessing a built-in index (does not include custom builds). Changing a fixed genome selection at runtime is not possible. Selecting a custom genome directly from the history is possible at runtime." Response: This is a bug of the current run workflow tool form and will be resolved once we switch to the upcoming revised version.

OK, so once Tophat is functional again we can use it as a model for other tools (again, this is why I picked it to compare to, but in the process uncovered the issues with it). I didn't look specifically to see which tools have this ref genome selection fully implimented. But the Samtools suite needs to. Rest should be examined (are probably just a few or few suites). Could be a master github ticket for all tools that need an update (as a check list) then pull-requests are attached as done? Goal for ref genome selection across tools in workflow editor & run-time?? 1 - ref genome from list of known builds, including custom builds in editor 2 - ref genome from the history in editor 3 - ref genome can be modifed at run-time, A) from list of known builds or 2) from the histroy 4 - UI has the same functionality and looks the same in all tools in all views for ref genome selection - e.g. consistency

5) "In editor, if an original extracted dbkey is altered to be set at runtime, it cannot be recovered when switching back to fixed mode" Response: This is an enhancement we should implement. It makes this feature more user friendly.

Ok, this is a smaller change. I'll submit it as a enhancement github ticket

6) "In editor, the dbkey is manually entered and can be left blank which will always trigger an error - Bug?" Response: Yes, its a bug. The current run workflow form and/or the web-controller should properly validate the inputs. See also: 4.

We can test this once the new version is on main, see if it still a bug, and needs a github issue

7) "In view (not submission form), if the genome is not fixed and to be selected at runtime, option does not state Select at runtime for all tools (example tools compared below in graphics). Editor does NOT allow option for runtime genome selection from the history - Enhancement?" Response: Yes, its an enhancement. This is related to #979. It seems that the challenge here is that conditionals and repeats often alter the number of input datasets and the run workflow tool form is unable to account for new inputs and workflow pipes. However, we should at least enhance this and allow conditionals/repeats without dataset implications to be set at run time.

It is related to #979. That one is a third style of genome selection in the editor and run-time function (as well as the tool form, but that is a style issue unrelated to function. I consider style secondary - I have only asked for functional consistency in the above - and 979 is also about functionality).

Although certainly using the same style in all tools (at least those we can change) is a big win. Maybe it needs to be moved out to a distinct function then called by tools (?). Or maybe it is and just not all tools use it (yet?).

Thanks for the great comments and help. I know this is a monster ticket. It is just the more I looked at a way to solve a user's problem with samtools for a workflow a nest of other issues popped out.

I'll wait for final feedback then make the tickets. Linking back here because I am SURE none of us want to go through the concepts in painful detail again, ha :)

yhoogstrate commented 8 years ago

I am not sure if the following is of any help, but maybe it is so I'll post it. When I had a similar problem I fixed this at the tool side as follows (https://github.com/ErasmusMC-Bioinformatics/featurecounts_galaxy_wrapper/blob/master/featurecounts/featurecounts.xml):

instead of using 2 selects:

            <param name="source_select" type="select" label="GFF/GTF Source">
                <option value="indexed_filtered">Use a built-in index (which fits your reference)</option>
                <option value="history">Use reference from the history</option>
            </param>

I added an extra select that is designed for running in workflows:

            <param name="source_select" type="select" label="GFF/GTF Source">
                <option value="indexed_filtered">Use a built-in index (which fits your reference)</option>
                <option value="history">Use reference from the history</option>
                <option value="attribute">Use a built-in index based on the 'metadata.dbkey' attribute; ideal in workflows</option>
            </param>

When a tool gets executed, the last option (attribute) is selected, it will look for the dbkey in the input file, and scans the data table to obtain the corresponding entry:

    #if $reference_gene_sets_source.source_select == "indexed_filtered"
        "$reference_gene_sets_source.reference_gene_sets"
    #else if $reference_gene_sets_source.source_select == "history"
        "$reference_gene_sets_source.reference_gene_sets"
    #else
        #*
            This is a workaround to obtain the "genome.fa" file that
            corresponds to the dbkey of the alignments.
            Because this file is "calculated" during run-time, it can
            be used in a workflow.
        *#
        "${ filter( lambda x: str( x[0] ) == str( { alignment.metadata.dbkey:True for alignment in $alignments }.keys()[0] ), $__app__.tool_data_tables[ 'gene_sets' ].get_fields() )[0][2] }"
    #end if
jennaj commented 8 years ago

Thanks @yhoogstrate !

The concept seems intersting to me, but let's see what @guerler and others think. Seem like it would be very useful to include enhanced genome select options, specifcally for workflow use, in such core tools.