ga4gh / task-execution-schemas

Apache License 2.0
80 stars 28 forks source link

Proposal: add "additional_properties" to tesResources #127

Closed MattMcL4475 closed 3 years ago

MattMcL4475 commented 4 years ago

According to the WDL spec, "The runtime section defines key/value pairs for runtime information needed for this task":

https://github.com/openwdl/wdl/blob/master/versions/1.0/SPEC.md#runtime-section

This commit extends tesResources to support an arbitrary dictionary of additional properties, which ensures WDL parsers can pass-through their full list of runtime key/value pairs to a TES backend. The canonical example of its use would be for WDL authors to be able to specify a specific cloud VM SKU in their WDL files.

adamstruck commented 4 years ago

Link to previous discussion: https://github.com/ga4gh/task-execution-schemas/issues/84

MattMcL4475 commented 3 years ago

Hi everyone, could we please consider this proposal (or equivalent functionality) for TES 1.1? We would like to implement this ASAP in Cromwell on Azure (https://github.com/microsoft/CromwellOnAzure). The specific feature we need is the ability to pass through an Azure VM SKU (string) from a WDL file to the backend (for example, a workflow might need a GPU machine, and so we want to be able to specify the GPU VM SKU in the WDL as a string, e.g. "Standard_NC64as_T4_v3"). We are happy to support any alternative solution that meets this requirement. Thank you.

aniewielska commented 3 years ago

There was the entire #84 that failed to conclude. I added my own take on it there, which is supportive of having custom resource definitions in TES. @kellrott - what is your view on additional_properties - so custom, backend specific runtime properties?

This PR as it is is targeting OpenAPI 2 version of the spec that we left in the repo (if we should is another story), whereas the current version is OpenAPI 3 (it would probably not change the contents of the PR in any way, though). We will soon update our develop branch to reflect recent changes in the API and make contributions possible. The remaining issues (if we hopefully reach conclusion that TES should have custom resources at all) are:

kellrott commented 3 years ago

I worry about just adding a blank set of parameters to the task, but not documenting them anywhere. If you encounter a new system, you won't know which magic key words will connect to the backend and have an effect. You would have to find the documentation for that specific sever installation or talk to someone. And what if you get a config misspelling and think you are doing something but the parameter has no actual effect, because it was supposed to be upper case.

One option might be to add a field in the server info that reports back all the property keys that a particular system responds to, and when you do a 'get task' after posting, the server is allowed to remove property keys it ignored?

MattMcL4475 commented 3 years ago

Hi Kyle @kellrott, thanks for your comments! Agreed that the keys should be case insensitive. Also agree that backends should document which key/values they support, and log warnings when there are keys included that are not interpreted/supported.

I think a good precedent for supporting an arbitrary dictionary of properties would be HTTP headers. RFC 6648 has some helpful guidance for creators of new parameters:

Creators of new parameters to be used in the context of application protocols:

  1. SHOULD assume that all parameters they create might become standardized, public, commonly deployed, or usable across multiple implementations.
  2. SHOULD employ meaningful parameter names that they have reason to believe are currently unused.
  3. SHOULD NOT prefix their parameter names with "X-" or similar constructs.

For Cromwell on Azure, our immediate need is to support passing the cloud VM size. On the three major clouds, these are named as follows:

Google: "Machine name", .e.g "e2-highcpu-32" AWS: "Instance Size", e.g. "m5.24xlarge" Azure: "Size", e.g. "Standard_NV32as_v4"

In conclusion, if this proposal is adopted, the very first key we would want to carefully consider naming would be the above; could be "VmSize" or "VmType" or similar.

aniewielska commented 3 years ago

@kellrott The idea of listing permitted additional_properties (or however we call them) in serviceInfo sounds like a good use of service info. However, if we list the keys only, the values can still be not interpretable to the backend. This will likely be the case with the example of VmSize where values would be different for different clouds. Removing not supported additional_properties from get task is a bit of a hack. It would be valuable to both know what was requested and what was used in the end. Is there a place for information about effective/used resources somewhere in the execution log, perhaps? @MattMcL4475 - how do you think the change would work from Cromwell perspective? Would Cromwell just pass through all unrecognised runtime attributes as additional properties (so everything else than the currently recognised: cpu, memory etc)?

kellrott commented 3 years ago

For the additional_properties in the serviceInfo, could we have the property key and a text description. So it would be something like

{ "param" : "VmSize", "docs" : "Select the running system using an AWS VM name"} 

It's not machine readable. Before moving to a system, a task deployment system (like cromwell) would have to be configured the the system specific flags, if they are to be used.

I agree that silently removing unused params is a bit hacky, but the other options are to 1) fail the job when additional_properties aren't recognized or 2) creating a new warning message output field in the task logging.

As you said, given that the job specification already allows for defining cpu_cores, disk_gb and ram_gb, a smart task runner should be able to select out the most appropriate VM. It seems the most use cases would center on making sure things like GPUs, or selecting between scratch storage types. Is that what you are considering @MattMcL4475 ? If the only use case is around making sure the resource requests correctly map to the vm selection, that might be something that would be better suited as a configuration of the TES server implementation.

MattMcL4475 commented 3 years ago

@aniewielska yes I think Cromwell would pass through all unrecognized runtime attributes as additional properties

@kellrott unfortunately this cannot be solved by logic - we have a hard requirement for the customer to be able to specify the exact VM they need. Sometimes they need to optimize for cost; sometimes for speed. Sometimes they need to specify hardware capabilities or co-processors (CPU model, GPU, FPGA, networking, etc.). Sometimes they have negotiated for a discounted price on a specific VM in a specific region with reserved capacity. There are simply too many variables and competing concerns to make automatic VM selection logic feasible.

kellrott commented 3 years ago

I think there is enough justification to state this is a required feature, and we can't figure out way to do it outside of the API. I would propose we target this for version 1.1. Issues that need to be addressed:

@aniewielska Any other requirements?

aniewielska commented 3 years ago

@kellrott - the list looks good to me. I am leaning towards leniency in case of unrecognised params (log/report and still execute the task, rather than return an error). And would welcome an alternative name than additional_properties reflecting better that those are custom, backend-specific resource requests. Perhaps explaining it in description (and the relation to the new serviceInfo field) is sufficient.