ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Workflow engine parameter categories #171

Closed vinjana closed 2 years ago

vinjana commented 3 years ago

According to the specs the service info returns as default_workflow_engine_parameters a list of "name", "type" and "default_value" dictionaries. For the run submission via POST the spec is even less detailed. There the workflow_engine_parameters are just a string.

The problem now is that command lines can be quite complex, with different categories of parameters. For instance, for nextflow run there are at least the following categories/types/slots of parameters:

In the moment, the only way to implement a discrimination between parameters for each of these slots, is by tracking in the WES implementation which names belong into which group. This is suboptimal, because the implementer has to cast a lot of knowledge into code, and, generally, such coded "knowledge" complicates the implementation, configuration, and it tends to outdate with new versions of engines, which means additional maintenance costs for new or even multiple different workflow engine versions.

By contrast, if the engine parameters contained the information into which parameter-category they belong, the implementer could focus on implementing the rules for processing the parameters, such as where to put the parameter in the CLI, whether to prefix them with no, a single, or double-dash, etc. Only for parameters where tighter control is necessary (e.g. limiting the number of cores), dedicated rules would need to be implemented. Adding or removing parameters in a specific slot would then just be adding a value to your POST call or your WES config file.

I see two ways to specify which parameter goes where in the CLI of the workflow engine.

  1. Use a dictionary with fields "environment", "command", etc. (whatever is appropriate for the engine) and put the parameter objects in there. That is
      {
         "environment": [
              {
                 "name": "some name",
                "value": "the value",
                 "type": "a type" 
              },
         ], 
      }
  2. Add a field to the parameter's name-value-type-object. E.g.
      [
         {
             "name": "some name",
             "value": "the value",  # left out for switches, such as Nextflow's `-with-trace`
             "type": "a type",
             "category": "the command slot"
         }
      ] 

I find solution 2 better, because it is more flexible and does not complicate the data structure further. Also the actual parameter objects/dicts could be allowed to be simplified, by default values for fields (e.g. no "type" = type is string, no "category" = command parameter). The structure should also be the same in the POST call for submitting a workflow and in the default_workflow_engine_parameters. So no "default_value", but just "value".

Here an example

[
   {
      "name": "NXF_OPTS",
      "value": "-Xmx256m",
      "category: "environment"
   },
   {
      "name": "with-trace"
   }

]

Could be mapped to NXF_OPTS="-Xmx256" nextflow run main.nf -with-trace.

It would be interesting to hear your opinions about the idea.

vinjana commented 3 years ago

A problem that may be related (in terms of implementation) is how the default engine parameters should be combined with the run's engine parameters. If there are just lists of Name/Value/Type[/Category] objects, it is not possible (at least not obvious how it should be possible) to remove a default parameter via a run`s engine parameter. Rephrased: In the moment it is not possible to say: Do not use engine parameter XY, no matter what is said in the defaults.

For instance, one has -with-trace defined as default engine parameter for Nextflow, but wants to turn it off, because of a bug (not to say that there is one in Nextflow).

One approach may be adding a field that indicates that the parameter should be interpreted as subtractive, rather than additive. This would fit the the "flat list of parameters" implementation.

uniqueg commented 2 years ago

The ability of a single API to provide as many behaviors of as many software solutions as possible is probably the biggest problem we are facing time and again in standardization efforts. The API will always be the needlehole, and we will probably have to find some middleground between the two extremes of

(1) supporting every functionality of a given solution (adding free key-value pairs wherever that occurs, but then upstream users will need to know about all of that, so there's not much of an advantage anymore over using everyone's own custom API) and

(2) supporting just the union of the range of available (and future?) solutions (whatever that is).

So far I think we have been trying to get as close to (2) as possible so as to get a good consensus and adoption and then maybe have some leverage to convince software developers (here: workflow engine developers) to agree on more common features and align their APIs/features with the common API offers.

(NB: Meanwhile, we could maybe have a look at how we could potentially create ontological mappings that would allow us to further define equivalent functionalities of the software solutions we are looking at and then create specs for those. I'm not an expert, but I think especially in bioinformatics there are a lot of people working on ontological mapping problems for their whole careers, so perhaps we could get some help there?)

I may be wrong, but I think the lack of responses to your important issue shows how reluctant people are to open this pandora's box. And what would we really gain here? The client (or, worse, the user) still would need to know about the parameter categories, so for them it would presumably be harder, not easier to use a given WES. Now, if you end up using the same WES every time for your jobs, the inconvenience of learning how to use your WES of choice might still be acceptable. But when you want to "bring the compute to the data" and thus need to talk to the WES that sits next to that data or, worse, execute your workflow on a network of WES services, you might be talking to different WES instances frequently, sometimes maybe not even knowing which one you will talk to. And these use cases, I think, are kind of the promise or rallye call of the Cloud WS, and they are jeopardized by added complexity on the client's side, especially if it involves arbitrary choices, like names for parameter classes. I'm aware that we already have that situation to some extent, by having the workflow_engine_parameters as a free field. But with the defaults I think the current implementation tries to strike a compromise between "one size fits all" and the ability to tweak the behavior when really really needed. So in summary, I tend to think that it's better to make it easier on the clients/users. So I think constructing the actual CLI command from workflow_engine_parameters is best done by the engine, even if painful.

But since we are on this topic, I do believe that we don't need everyone implementing their own WES services, with each one possibly supporting multiple different engines. What I think we should move to is to have one well-maintained WES per engine, ideally implemented and maintained by the workflow engine developers/developer communities themselves. As far as I can see that would be 5-10 or so given the communities that are currently involved with the GA4GH. Together these engines would probably be able to run >80% of life science workflows that are written today (excluding "Bash workflows" and the like). And if we have that situation, then workflow engine developers would probably think hard how they could harmonize/standardize their CLI and cloud usage (and perhaps even across engines) and push workflow developers to follow best practices, provide JSON schemas for their inputs etc., and so maybe these problems get smaller and smaller.

How do we get there? I'm not sure. But I think that we should get to WES (TES, DRS, TRS) being used "in production" as soon as possible for the "standard" use cases that work well with defaults. Once there's a critical mass of users and requests coming in to developers, the rest will probably follow (note that Cromwell and Galaxy already have their own WES shims, and Toil is releasing a WES soon). So I'd probably say that for now, let's focus on the set of "easy" cases where workflows can be run with default settings - and let's keep everything else "experimental".

That being said, I would really like if there was some GA4GH-wide policy/strategy on how to deal with the general issue of supporting many vs supporing common features, because I know that this pops up time and again here and in other standards. And typically, those discussions don't lead to a consensus and only work up and frustrate people. For example, the time it took to think about what you proposed and write it down and then no one answering? Or kicking off an endless discussion that then leads nowhere for lack of consensus? Many people are anyway shy to speak up and suggest something. But out of the ones that do, I'm sure many people won't do it a second time after such an experience. So having some help here in the form of some guidelines which issues to post/better not to post, with some justification, might perhaps be useful.

vinjana commented 2 years ago

I agree, that making the API too flexible is a bad idea. If at all, then such categories should only allow for values that come from a controlled vocabulary (e.g. an ontology). Even that would still be very flexible (and maybe too flexible). And I now agree, that the problem, the original motivation, I describe, is something that should not bleed into the API at all.

The only strictly problematic situation that could serve as motivation (that I can think of now) is that the same parameter is used in two different semantics (e.g. if you'd think of a CLI call, as global parameter and "run" mode parameter). This is logically possible but may be considered bad design on the side of the workflow engine. It's reasonable to not support this. After such a decision, I do not see any strict need for annotating input parameters with categories.

I completely agree that more clarity about how the already defined API should be used or not be used is needed. I see in other discussions in this project, that there are advocates for extremely flexible API design, which supports your idea.

So, concerning this particular thread, I propose to close the issue. Do you agree?