Open SarahG-579462 opened 1 month ago
I think the TX_MAX
could be simplified like so:
inputs:
indice:
type:
type: enum
symbols:
- tx_max
inputBinding:
position: 2
tasmax:
doc: Maximum daily temperature.
type: string?
default: tasmax
inputBinding:
prefix: --tasmax
freq:
doc: Resampling frequency.
type:
- "null"
- type: enum
symbols:
- YS
- MS
# ... ?
default: YS
inputBinding:
prefix: --freq
Any input that has a very specific set of values should define a type: enum
/symbols
.
This should result (with the other inputs I didn't repeat), to a call like:
docker run \
[...volume mounts, user-id map opts, etc. ...] \
localhost/xclim:latest \
xclim \
--input /tmp/mounted/file.nc \
--output /tmp/mounted/out.nc \
tx_max \
--tas tasmax \
--freq YS
Defining the inputs this way makes them look more natural/similar to what xclim
expect (ie: using cwltool ... --freq MS
rather than cwltool ... --TX_MAX.freq MS
).
Similarly, using a job file would use names and values that are easier to define:
input:
class: File
path: "/path/to/file.nc"
freq: MS
tasmax: tasmax
To me, it sounds odd that there would be any start-up latency from the container if it was prebuilt, and that nothing triggers rebuilding it each time (modified file in context for example).
I have noticed that calling xclim
by itself has a noticeable start-up latency, so not sure if the container is actually at cause at all.
Another thing to consider when defining the CWL.
xclim
takes as input a --output
file path.
However, the actual path from the point of view of the CWL/container will be mounted volumes with temporary dirs to do the processing.
Therefore, the path doesn't really matter. Only the file name does. The CWL output should do a glob
considering this. Something along the lines of :
inputs:
output:
type: string
inputBinding:
position: 1
prefix: --output
valueFrom: "$(runtime.outdir)/$(self.basename)"
outputs:
output:
type: File
outputBinding:
glob: "$(inputs.output.basename)"
Then, calling CWL with:
cwltool --outdir /tmp xclim-tasmxa.cwl --input /path/to/in.nc --output result.nc --freq YS
Should create the file /tmp/result.nc
.
But what CWL will have done is actually mount the created temp dirs, retrieve the output from the runtime dir, and stage out the output to the requested --outdir
.
Addressing a Problem?
CWL is a language to standardize function inputs and outputs, and is used for creating data workflows, particularly in other geospatial applications. It is a planned addition to pygeoapi and, more generally, OGC-Processes. Adding support for xclim to be used through this language would be very helpful for people who don't want to dig through python code and just want a plug-and-play solution to compute indices/bias correct/etc.
Potential Solution
I have a working prototype for individual indicators in CWL at the moment, see the additional context below for the code snippet. It creates a docker container for the command line tool, which means there is a lag in running any command, but this may be acceptable for some users?
I have the beginnings of a prototype for CWL for all commands together, but it is still non-functional. (I don't fully understand the language yet!)
In order to avoid the start-up latency, I see a few options:
Add support for other sections of xclim than just indicator calculation: bias correction, spatial analogues, unit standardization, etc... This could be done by augmenting the CLI for xclim.
Additional context
#### Exacmple Code for the *CWL* indicator calculations
```yaml cwlVersion: v1.2 class: CommandLineTool id: xclim_tx_max label: Maximum temperature doc: | Maximum of daily maximum temperature. requirements: EnvVarRequirement: envDef: PYTHONPATH: /app ResourceRequirement: coresMax: 1 ramMax: 512 hints: DockerRequirement: dockerPull: localhost/xclim:latest baseCommand: ["xclim"] arguments: [] inputs: input: type: File inputBinding: position: 0 prefix: --input output: type: string inputBinding: position: 1 prefix: --output TX_MAX: type: type: record fields: - name: tasmax doc: | Maximum daily temperature. Default : tasmax. type: string? inputBinding: prefix: --tasmax - name: freq doc: | Resampling frequency. Default : YS. type: string? inputBinding: prefix: --freq name: tx_max inputBinding: position: 2 prefix: tx_max outputs: outdir: outputBinding: glob: "*.nc" type: File[] ```#### Code for generating indicators CWL, and beginnings of a master CWL
```python # Generate CWL files from xclim Indicators import yaml from pathlib import Path from xclim.core.utils import InputKind from loguru import logger template = Path("cwl_template.yaml") template_str = template.read_text() master_template = Path("cwl_master.yaml") master_str = master_template.read_text() step_template = Path("cwl_step.yaml") step_str = step_template.read_text() fields_template_str = """ - name: {param} doc: | {doc} type: string{optional_flag} inputBinding: prefix: --{param} """ fields_template_enum = """ - name: {param} doc: | {doc} type: {optional_flag} - type: enum symbols: {symbols} inputBinding: prefix: "--{param}" """ input_template = """ {indicator_id}: type: type: record fields: {fields} name: {indicator} inputBinding: position: 2 prefix: {indicator} """ docker_path = "/app" docker_image = "localhost/xclim:latest" import xclim as xc param_str = "{indicator_id}.{param}: {indicator_id}.{param}" # indicators = xc.core.indicator.registry indicators = {'TX_MAX':xc.core.indicator.registry['TX_MAX']} steps = [] param_fields = [] for name, ind in indicators.items(): ind_instance = ind.get_instance() logger.info("Processing Indicator: " + ind_instance.identifier) field_arr = [] param_list = [] for param_name, param in ind_instance.parameters.items(): if param_name in ["ds"] or param.kind == InputKind.KWARGS: continue param_list.append(param_str.format(param=param_name, indicator_id=name)) optional_flag = "" doc = [param.description.replace("\n", "\n ")] if param.default: doc.append(f"Default : {param.default}.") if "choices" in param: choices = f"\n Choices: {param.choices}" doc.append(choices) doc = "\n ".join(doc) if param.default: optional_flag = '- type: "null"' field = fields_template_enum.format( param=param_name, symbols="\n ".join([f'- "{c}"' for c in param.choices]), optional_flag = optional_flag, doc = doc, ) else: if param.default: optional_flag = '?' doc = "\n ".join(doc) field = fields_template_str.format( param=param_name, optional_flag=optional_flag, doc=doc, ) field_arr.append(field) fields = "\n".join([field.replace("\n", "\n ") for field in field_arr]) #param_fields.append("\n".join([field.replace("\n", "\n ") for field in field_arr])) inputs = input_template.format( indicator_id=name, indicator=ind_instance.identifier, fields=fields ) param_fields.append(inputs) cwl = template_str.format( indicator_id=name, indicator=ind_instance.identifier, indicator_label=ind_instance.title, indicator_doc=ind_instance.abstract.replace("\n", "\n "), docker_path=docker_path, docker_image=docker_image, indicator_inputs=inputs, ) filename = Path(f"cwl/{name}.yml") with open(filename, "w") as f: f.write(cwl) # for each indicator, also generate a step and add to the master CWL.abs param_list = '\n '.join(param_list) step = step_str.format( indicator_id=name, indicator=name, file = filename.name, params=param_list ) steps.append(step.replace("\n", "\n ")) break master_cwl = master_str.format( steps="\n ".join(steps), params="\n ".join([p.replace("\n", "\n ") for p in param_fields]), ) logger.info("Writing master CWL") with open("cwl/master.yml", "w") as f: f.write(master_cwl) ```#### Creating a docker image for xclim:
```docker FROM python:3.10-slim WORKDIR /app RUN pip install xclim loguru h5netcdf --no-cache-dir USER root COPY cwl.py . COPY *.yaml . RUN mkdir /app/cwl #RUN python -m compileall `python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())"` USER $USER ```#### Templates for the CWL generator:
`cwl_template.yml`: ```yaml cwlVersion: v1.2 class: CommandLineTool id: xclim_{indicator} label: {indicator_label} doc: | {indicator_doc} requirements: EnvVarRequirement: envDef: PYTHONPATH: {docker_path} ResourceRequirement: coresMax: 1 ramMax: 512 hints: DockerRequirement: dockerPull: {docker_image} baseCommand: ["xclim"] arguments: [] inputs: input: type: File inputBinding: position: 0 prefix: --input output: type: string inputBinding: position: 1 prefix: --output {indicator_inputs} outputs: outdir: outputBinding: glob: "*.nc" type: File[] ``` `cwl_step.yml`: ```yaml {indicator}: run: {file} when: $( (inputs.indicator == {indicator} ) in: input: input output: output {params} out: outdir: outdir ``` `cwl_master.yml` ```yaml cwlVersion: v1.2 $graph: - class: Workflow requirements: - MultipleInputFeatureRequirement - SubworkflowFeatureRequirement - InlineJavascriptRequirement - DockerRequirement inputs: input: type: string output: type: string indicator: type: string {params} steps: {steps} outputs: outdir: type: File outputSource: valueFrom: ${{ inputs.indicator + '/outdir' }} ```#### Commands for docker/podman, running the CWL:
Build the image: `podman build -t localhost/xclim:latest .` Create the CWL files: `podman run -v $(pwd)/cwl/:/app/cwl -v $(pwd)/cwl.py:/app/cwl.py localhost/xclim:latest python /app/cwl.py`: Run Indicator calculations: `cwltool --podman --outdir runs cwl/TX_MAX.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --TX_MAX.freq ME` (not working) run indicator calculations thru master CWL: `cwltool --podman --outdir runs cwl/master.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --indicator TX_MAX --TX_MAX.freq MS`Related issues: #1949
This idea came up during the CLINT/OGC code sprint in Bonn, this October.
Contribution
Code of Conduct