Sample Sheet Workflow Inputs

jmchilton commented 3 weeks ago

The idea would be a new workflow input that would define columns optionally (defaulting to just one - a list of pseudo datasets - that could be either datasets or remote URIs - the way rule builder works) and then optionally a default set of rules to apply to the columns - so we could render something like a sample sheet for the workflow input. Some of the rules could be editable at runtime (e.g. how to parse forward and reverse) and some could fixed - how to parse tags or list identifiers for collections. We could also provide a syntax for implicit defaults for columns (e.g. inferring conditions or reps from file names).

This would be for advanced workflows where the workflow author has a lot of information about what collections of inputs should look like but we don't have a lot of capacity to share this in an enforceable way between the workflow author and the workflow runner (https://github.com/galaxyproject/iwc/pull/581 https://github.com/galaxyproject/iwc/pull/582). Our focus on optimizing the easy cases is important (#18704) - but complex workflows with multiple conditions, samples, reps, etc... are very common and providing the capacity to users to run these without needing to learn R, Matlab, Bash, etc... is core to what we should be doing as a project IMO.

There are use cases where there is a clear hierarchy of datasets - say multiple non-overlapping treatments with repetitions - where nested collections might be totally fine but we have no good way to direct users to create collections of the right type and even if they understood the workflow completely - the rule builder is very complex to learn. There are other use cases where tagging might be better than nested collections because datasets might belong to multiple conditions or categories. We likewise don't have a good way to point users at this from a workflow or a nice interface short of the full blown rule builder to construct such things. The advantage of defining sample sheets this way and then backing it up with an initial step using the Apply Rules tool is that we can target both advanced use cases. We can also hide which approach we're using behind the sample sheet - a typical workflow running user doesn't need to understand nested collections or the complexity of tracking group tags through a workflow.

The weight on the workflow author would be kind of heavy - needing to understand a syntax for how to sculpt the data comparable to the rule builder - but the interface for the end user would be much easier I think. It could default to just a little spreadsheet with simple checks and simple defaults. It would be all the power of using rule builder within a workflow without the complexity of needing to understand rules or collections on the workflow runner.

bgruening commented 3 weeks ago

I fully agree.

The weight on the workflow author would be kind of heavy - needing to understand a syntax for how to sculpt the data comparable to the rule builder - but the interface for the end user would be much easier I think.

Why "comparable to the rule-builder"? Can we not just use the rule-builder? A tabular input and a predefined rule-builder-json (crafted by the workflow-builder) will bring us a long way, imho.

mvdbeek commented 3 weeks ago

Could you provide an example of how the column definition would look like to the workflow author ? I was hoping we'd attach just a little more structure to our collection inputs (basically just free-style hints as to what each layer of a collection represents and optionally a min and max for the number of elements). That would seem fairly easy to build a UI for, and I could envision a nice schema for that.

jmchilton commented 3 weeks ago

@mvdbeek I'm not opposed to a little schema language for specifying constraints on dataset collections in that way - but the sample sheet approach I have in my mind would have the capacity to really hide all the details about collection structured from users. I think the difference in user experience is something like "drop a bunch of files or URIs in and fill out sample name, treatment, and condition columns" vs "Create a nested collection with the following structures". The second seems to require a lot more Galaxy knowledge and knowledge about that particular workflow. Also we don't have a nested collection builder UI component other than the rule builder.

@bgruening I very much was hoping to reuse a lot of the rule builder component - but I think setting up rules for reuse in this context adds new concepts like "set rule as editable at runtime" and I guess in the abstract we wouldn't really have a preview to render the way we do for the rule builder. It feels like 85% of the concepts overlap and 85% of the UI could be reused... but it would be different in its own ways. Hence comparable to the rule builder and not exactly the rule builder. The rule builders is already sort of different depending on if you're coming from uploads or using the Apply Rules tool. It would be another modality.

galaxyproject / galaxy

Sample Sheet Workflow Inputs #19085