Multiple=true Optional=false tool input no longer works as intended

innovate-invent commented 3 years ago

Given the documentation here: https://docs.galaxyproject.org/en/release_20.09/dev/schema.html#id31

Perhaps counter-intuitively, a multiple="true" data parameter requires at least one data input. If optional="true" is specified, this condition is relaxed and the user is allowed to select 0 datasets.

Tools should not execute if they receive an empty collection and optional=False. This was true up until a recent update. I am testing on 20.09. I expect the same behavior as when a tool is mapped over an empty collection.

Attached is a workflow demonstrating the issue. empty_collection_succeed.ga.txt

I tested this on usegalaxy.org and saw the same issue.

Edit: another issue may have been masking this one and this may have existed in earlier versions of Galaxy

bernt-matthias commented 3 years ago

see https://github.com/galaxyproject/galaxy/pull/12402

@innovate-invent can you check which dataset is supplied to the parameter?

innovate-invent commented 3 years ago

I am not sure which dataset or parameter you are referring to

bernt-matthias commented 3 years ago

The parameter with multiple=true and optional=false. I guess it actually gets an dataset, even if none is explicitly selected.

At least this is what happens in the test in the linked PR: https://github.com/galaxyproject/galaxy/blob/f5ccc53f4dd494fa23f9fdb63e568bf1123c2fda/test/functional/tools/multi_data_param.xml#L52

innovate-invent commented 3 years ago

The test you linked doesn't exactly test this issue. This is about a tool in a workflow receiving an empty collection, it should not execute the tool the same as a tool that is mapped over an empty collection.

Running the tool outside of a workflow should error if no input is selected, but the 'mapping over' - skip execution behaviour does not apply.

mvdbeek commented 2 years ago

Tools should not execute if they receive an empty collection and optional=False.

What does that mean for the step. Should the step execute but not the tool ?

I expect the same behavior as when a tool is mapped over an empty collection.

A collection is not a dataset however, so we haven't really specified what should happen. Also note that this is about providing a dataset or not. An empty collection is still specified as input, that is not the same as not providing a dataset. And not providing a dataset and not providing a collection will indeed result in the same behavior ;).

Tools should not execute if they receive an empty collection and optional=False.

I think we're venturing into undefined behavior territory here. Before fixing or changing things or declaring victory/defeat and/or arguing with past behavior we should define what is supposed to happen in all possible cases.

The options as I see them are:

1. Fail the step in a workflow and in the tool form don't allow this input
  - In the tool form that is probably what happens effectively, since there's no datatype the dataset matcher won't return a match
1. Map over the empty collection, produce an "empty" reduced output (could be single dataset or collection)
  - We don't have the equivalent of an empty collection for datasets. That said, an empty collection could be a nice output proxy (even if we're supposed to get a dataset) for a skipped job/step.
1. Map over the empty collection, don't produce any output
  - This would probably be an issue for downstream steps not finding their inputs.
1. Provide an empty array of datasets in the collection, run the the job
  - A lot of tools iterate over the input, and some of them will probably succeed. It's too unpredictable though to really be an option

Option 2 and 3 sound good to me, but I'm worried about this being very implicit. When an invocation has finished and you look at the result, did you expect a "skip" there, or did something go wrong ? Maybe we need to have some language feature to say "this output may be empty and that's fine" and otherwise we'll fail the step.

jmchilton commented 2 years ago

It seems like a multiple dataset parameter should always be a reduction... just as a strong typing argument... so I don't think I like ii or iii conceptually/theoretically?

iv feels like what we should do in the abstract, but then it should be handed off to the tool logic and i seems the correct result to me. I think I agree with @innovate-invent here... though to be more specific I would make the condition:

"Tools should not execute if they receive an empty collection and optional=False on a multiple=True dataset input parameter."

I suspect we introduced more null handling for conditionals and such but we're still not doing a good job distinguishing between null and an empty collection?

mvdbeek commented 2 years ago

"Tools should not execute if they receive an empty collection and optional=False on a multiple=True dataset input parameter."

So ... we just don't generate a job ? Do we fill downstream step connections with null then ? And isn't that in contradiction to option 4 ? Or was the idea that option 4 is for optional=True case?

jmchilton commented 2 years ago

Not generating a job seems like a good optimization and generating a job but the input validation stuff failing it seems like a fine response also?

No you're right - I was unclear there and imprecise. I guess when I said "run the job" I meant run the job setup code and presumably let it is just fail because the job inputs are invalid? I definitely don't think we should "run the job" in the sense of actually running the job on the cluster.

innovate-invent commented 2 years ago

There is a contract between Galaxy and the tool where if the input is multiple=true but optional=false then there must be at least 1 dataset. The tool code is designed to expect at least one dataset and should not be expected to have the case of zero datasets defined. I would prefer to have functionality that handles this scenario gracefully rather than just tossing an error or violating the contract and generating the tool script.

Empty collections are common and I think this should be approached so that it makes workflows more robust: Lets say we have tools A, B, C, and D in a workflow. => is an empty collection and -> is a single dataset. B is the tool with a multiple=true optional=false input. Depending on the example its output is a single dataset or a collection. C is a tool with an optional single dataset input, single dataset output. D is a tool with a non-optional single dataset input, single dataset output. E is a tool with a multiple=true optional=true input, single dataset output.

I think these are rational behaviors given the following workflows: A=>B->C-> : B is not run, C is run as if its input was not provided A=>B=>C=> : B is not run, C is mapped over an empty collection and not run A=>B->D-> : B is not run, D is not run, any non-optional inputs attached to D are also not run A=>B=>D=> : B is not run, D is mapped over an empty collection and outputs an empty collection A=>B->E-> : B is not run, E is run as if its input was not provided A=>B=>E-> : B is not run, E is run as if its input was not provided

galaxyproject / galaxy

Multiple=true Optional=false tool input no longer works as intended #11291