Open dannon opened 5 years ago
Related ticket https://github.com/galaxyproject/galaxy/issues/5632
Example: Thread today at Galaxy Help that links to another related thread (same issue from an end-user perspective): https://help.galaxyproject.org/t/workflow-is-not-operating-on-a-collection/586/2
Note for end-users: If your workflow is not executing correctly, perhaps with no inputs, or seems to "stop" at some point even with inputs included, please try the following (until this ticket closes):
This is another report: https://help.galaxyproject.org/t/workflow-does-not-show-in-history/934 And it does not involve collections.
I think this issue is a big problem. There is no intuitive way to solve it, the fact you have to connect tools “from left to right” in order to be able to execute workflow is illogical.
In re:
the fact you have to connect tools “from left to right” in order to be able to execute workflow is illogical
Imagine a workflow with four steps - a paired input (d,d), a list input [p], and two instances of a tool that takes two datasets and produces one f1: d1, d2 -> d3 and f2: d1, d2 -> d3. Imagine attaching the paired input to f1's first input. f1 is now being mapped over a paired input. Now imagine attaching f2's output to f1's second input. This is valid - you can pass a single input to something being mapped over a collection on a different input. Now imagine attempting to connect the list input to one of f2's inputs - we cannot because that'd cause its output to become a list which would make the output incompatible with an input it is connected to. Now imagine that there are twenty steps between f2 and f1.
We could in theory solve this constraint problem for every node when an output is "drag started". We'd have to build n different copies of the workflow and check each forof their validity. I have no evidence either way if this could be done in a performant manner, if I had to guess I'd say no just because the workflow editor already has problems on large graphs and I can’t imagine a solution to this that isn’t O(n^2) where n is the number of nodes in the graph - and the current calculation is already quite complex even though it is only O(1).
Our solution to not solving this global constraint problem for each node for each drag is in instead that map/reduce constraints are handled locally to each node. Let's imagine for a second that we drop this and solve the global constraint problem for each alternative graph for each input each time a node is dragged. Let's walk through some potential problems with the confusing behavior and question correctness that would result.
Right now the mapping of a node is maintained on its terminals (its inputs and outputs determine it), if instead it was determined by the whole chain of the workflow behind it, imagine you have f1 again - and a collection input 20 steps away is causing it to produce a list. Imagine also that f1 is being fed into a tool that is fed into another tool that is fed into another tool that consumes a list. Now let’s imagine we want to insert a new preprocessing step right before f1. We disconnect the list input - the whole global structure of the workflow from that point on becomes different for a second while we attach another list step to it. In that intermediate state - the tool that was being mapped over to produce a list and connected to an actual list input several steps down the chain was completely invalid. Do we disconnect all the invalid nodes? Do we add a new invalid connection type? How do we warn the user? I'd argue this behavior is just as confusing as the behavior that is being called “illogical” in this issue.
The will it or won’t it accept logic for these nodes is already pretty complex, I’d imagine someone smarter than me could do it better - but I did it and I wrote thousands of lines of tests for it and it took two weeks solid I believe. “Fixing” this behavior would probably require logic that is an order or magnitude more complex - in addition to being potentially unperformant and also requiring adding in a whole new set of logic, UI, etc.. for temporarily invalid connections across the whole workflow for minor changes to the workflow. Let’s say we somehow solve the performance question of this and develop some UI that conveys whole sections of the workflow downstream are going to change and become invalid because of changes on the other side of the workflow - let’s ponder the question then of whether all that effort has been in service of something “correct”.
Imagine a workflow that produces a nested list in the middle and uses the flatten tool as a reduction step (or the Apply Rules tool - or something else of this variety). Now imagine the user takes an input step that is just a dataset somewhere - and replaces it with a collection. We're going to add a new mapping level to the whole workflow. We'd expect all the summarized list outputs then to become lists - but they will not because that flatten step is going to absorb the extra layer of nesting and the whole semantics of the workflow are going to change in a completely un-intuitive way.
The current alternative is a lot of work admittedly, the user has to re-connect all the nodes and re-verify the logic makes sense. But at least they are revisiting all the potential changes and hopefully verifying they are correct.
This seems like details that users shouldn’t have to deal with, but I think that is only because we did a good job making the 90% seem really intuitive and easy. The “correct” solution isn’t easy and it does in fact require users to understand what is going on and make decisions.
Realistically, workflow developers need to deal with issues of how to map and reduce at each level they are used. This is what CWL does, this is what WDL does (https://software.broadinstitute.org/wdl/documentation/spec#scatter), this is what Nextflow does (https://github.com/nextflow-io/patterns/blob/master/docs/process-per-file-path.adoc). Galaxy took an implicit approach to these things which is potentially very different than what one would want to do sometimes - we solved the 90% problem but it needs to be explicit and users have to get used to being able to see the mapping and reduction clearly and have knobs to turn. The step local decisions about what mapping and reductions look like are the right ones - the focus should be on making those decisions more transparent and configurable and easier to understand and modify.
(Thanks to @mvdbeek for reviewing this comment before publication.)
I had two misconceptions going into this:
None of the above is correct which makes me think the current implementation is very good and https://github.com/galaxyproject/galaxy/pull/7642 makes it almost perfect.
I was revisiting this, and I still 100% agree with @jmchilton's analysis ... but maybe we can try to get closer to allowing "simple" cases like the opening scenario by doing some checks when hovering over a drag target. In this scenario
Now imagine attempting to connect the list input to one of f2's inputs - we cannot because that'd cause its output to become a list which would make the output incompatible with an input it is connected to. Now imagine that there are twenty steps between f2 and f1.
We could in theory solve this constraint problem for every node when an output is "drag started". We'd have to build n different copies of the workflow and check each forof their validity.
we could attempt to connect when hovering over a "orange connector" and track intermediate changes in vuex, so we can either undo or offer to disconnect and review once we hit something that is not compatible.
Tracking mutations is something we need to do for an undo button as well, so maybe we could build on that.
Given the above workflow, if you want to swap from a singular input to a collection you must first disconnect the link between the two random lines steps. We should be able to validate and propagate an input change from singular to collection on the fly.
(didn't see one at first glance, but if there is an open issue for this already feel free to link it up and close out)