Open innovate-invent opened 3 years ago
see https://github.com/galaxyproject/galaxy/pull/12402
@innovate-invent can you check which dataset is supplied to the parameter?
I am not sure which dataset or parameter you are referring to
The parameter with multiple=true and optional=false. I guess it actually gets an dataset, even if none is explicitly selected.
At least this is what happens in the test in the linked PR: https://github.com/galaxyproject/galaxy/blob/f5ccc53f4dd494fa23f9fdb63e568bf1123c2fda/test/functional/tools/multi_data_param.xml#L52
The test you linked doesn't exactly test this issue. This is about a tool in a workflow receiving an empty collection, it should not execute the tool the same as a tool that is mapped over an empty collection.
Running the tool outside of a workflow should error if no input is selected, but the 'mapping over' - skip execution behaviour does not apply.
Tools should not execute if they receive an empty collection and optional=False.
What does that mean for the step. Should the step execute but not the tool ?
I expect the same behavior as when a tool is mapped over an empty collection.
A collection is not a dataset however, so we haven't really specified what should happen. Also note that this is about providing a dataset or not. An empty collection is still specified as input, that is not the same as not providing a dataset. And not providing a dataset and not providing a collection will indeed result in the same behavior ;).
Tools should not execute if they receive an empty collection and optional=False.
I think we're venturing into undefined behavior territory here. Before fixing or changing things or declaring victory/defeat and/or arguing with past behavior we should define what is supposed to happen in all possible cases.
The options as I see them are:
Option 2 and 3 sound good to me, but I'm worried about this being very implicit. When an invocation has finished and you look at the result, did you expect a "skip" there, or did something go wrong ? Maybe we need to have some language feature to say "this output may be empty and that's fine" and otherwise we'll fail the step.
It seems like a multiple dataset parameter should always be a reduction... just as a strong typing argument... so I don't think I like ii or iii conceptually/theoretically?
iv feels like what we should do in the abstract, but then it should be handed off to the tool logic and i seems the correct result to me. I think I agree with @innovate-invent here... though to be more specific I would make the condition:
"Tools should not execute if they receive an empty collection and optional=False on a multiple=True dataset input parameter."
I suspect we introduced more null handling for conditionals and such but we're still not doing a good job distinguishing between null and an empty collection?
"Tools should not execute if they receive an empty collection and optional=False on a multiple=True dataset input parameter."
So ... we just don't generate a job ? Do we fill downstream step connections with null then ? And isn't that in contradiction to option 4 ? Or was the idea that option 4 is for optional=True case?
Not generating a job seems like a good optimization and generating a job but the input validation stuff failing it seems like a fine response also?
No you're right - I was unclear there and imprecise. I guess when I said "run the job" I meant run the job setup code and presumably let it is just fail because the job inputs are invalid? I definitely don't think we should "run the job" in the sense of actually running the job on the cluster.
There is a contract between Galaxy and the tool where if the input is multiple=true but optional=false then there must be at least 1 dataset. The tool code is designed to expect at least one dataset and should not be expected to have the case of zero datasets defined. I would prefer to have functionality that handles this scenario gracefully rather than just tossing an error or violating the contract and generating the tool script.
Empty collections are common and I think this should be approached so that it makes workflows more robust: Lets say we have tools A, B, C, and D in a workflow. => is an empty collection and -> is a single dataset. B is the tool with a multiple=true optional=false input. Depending on the example its output is a single dataset or a collection. C is a tool with an optional single dataset input, single dataset output. D is a tool with a non-optional single dataset input, single dataset output. E is a tool with a multiple=true optional=true input, single dataset output.
I think these are rational behaviors given the following workflows: A=>B->C-> : B is not run, C is run as if its input was not provided A=>B=>C=> : B is not run, C is mapped over an empty collection and not run A=>B->D-> : B is not run, D is not run, any non-optional inputs attached to D are also not run A=>B=>D=> : B is not run, D is mapped over an empty collection and outputs an empty collection A=>B->E-> : B is not run, E is run as if its input was not provided A=>B=>E-> : B is not run, E is run as if its input was not provided
Given the documentation here: https://docs.galaxyproject.org/en/release_20.09/dev/schema.html#id31
Tools should not execute if they receive an empty collection and optional=False. This was true up until a recent update. I am testing on 20.09. I expect the same behavior as when a tool is mapped over an empty collection.
Attached is a workflow demonstrating the issue. empty_collection_succeed.ga.txt
I tested this on usegalaxy.org and saw the same issue.
Edit: another issue may have been masking this one and this may have existed in earlier versions of Galaxy