Ouranosinc / xclim

Library of derived climate variables, ie climate indicators, based on xarray.
https://xclim.readthedocs.io/en/stable/
Apache License 2.0
333 stars 59 forks source link

Add CWL support for xclim #1955

Open SarahG-579462 opened 1 month ago

SarahG-579462 commented 1 month ago

Addressing a Problem?

CWL is a language to standardize function inputs and outputs, and is used for creating data workflows, particularly in other geospatial applications. It is a planned addition to pygeoapi and, more generally, OGC-Processes. Adding support for xclim to be used through this language would be very helpful for people who don't want to dig through python code and just want a plug-and-play solution to compute indices/bias correct/etc.

Potential Solution

Additional context

#### Exacmple Code for the *CWL* indicator calculations ```yaml cwlVersion: v1.2 class: CommandLineTool id: xclim_tx_max label: Maximum temperature doc: | Maximum of daily maximum temperature. requirements: EnvVarRequirement: envDef: PYTHONPATH: /app ResourceRequirement: coresMax: 1 ramMax: 512 hints: DockerRequirement: dockerPull: localhost/xclim:latest baseCommand: ["xclim"] arguments: [] inputs: input: type: File inputBinding: position: 0 prefix: --input output: type: string inputBinding: position: 1 prefix: --output TX_MAX: type: type: record fields: - name: tasmax doc: | Maximum daily temperature. Default : tasmax. type: string? inputBinding: prefix: --tasmax - name: freq doc: | Resampling frequency. Default : YS. type: string? inputBinding: prefix: --freq name: tx_max inputBinding: position: 2 prefix: tx_max outputs: outdir: outputBinding: glob: "*.nc" type: File[] ```
#### Code for generating indicators CWL, and beginnings of a master CWL ```python # Generate CWL files from xclim Indicators import yaml from pathlib import Path from xclim.core.utils import InputKind from loguru import logger template = Path("cwl_template.yaml") template_str = template.read_text() master_template = Path("cwl_master.yaml") master_str = master_template.read_text() step_template = Path("cwl_step.yaml") step_str = step_template.read_text() fields_template_str = """ - name: {param} doc: | {doc} type: string{optional_flag} inputBinding: prefix: --{param} """ fields_template_enum = """ - name: {param} doc: | {doc} type: {optional_flag} - type: enum symbols: {symbols} inputBinding: prefix: "--{param}" """ input_template = """ {indicator_id}: type: type: record fields: {fields} name: {indicator} inputBinding: position: 2 prefix: {indicator} """ docker_path = "/app" docker_image = "localhost/xclim:latest" import xclim as xc param_str = "{indicator_id}.{param}: {indicator_id}.{param}" # indicators = xc.core.indicator.registry indicators = {'TX_MAX':xc.core.indicator.registry['TX_MAX']} steps = [] param_fields = [] for name, ind in indicators.items(): ind_instance = ind.get_instance() logger.info("Processing Indicator: " + ind_instance.identifier) field_arr = [] param_list = [] for param_name, param in ind_instance.parameters.items(): if param_name in ["ds"] or param.kind == InputKind.KWARGS: continue param_list.append(param_str.format(param=param_name, indicator_id=name)) optional_flag = "" doc = [param.description.replace("\n", "\n ")] if param.default: doc.append(f"Default : {param.default}.") if "choices" in param: choices = f"\n Choices: {param.choices}" doc.append(choices) doc = "\n ".join(doc) if param.default: optional_flag = '- type: "null"' field = fields_template_enum.format( param=param_name, symbols="\n ".join([f'- "{c}"' for c in param.choices]), optional_flag = optional_flag, doc = doc, ) else: if param.default: optional_flag = '?' doc = "\n ".join(doc) field = fields_template_str.format( param=param_name, optional_flag=optional_flag, doc=doc, ) field_arr.append(field) fields = "\n".join([field.replace("\n", "\n ") for field in field_arr]) #param_fields.append("\n".join([field.replace("\n", "\n ") for field in field_arr])) inputs = input_template.format( indicator_id=name, indicator=ind_instance.identifier, fields=fields ) param_fields.append(inputs) cwl = template_str.format( indicator_id=name, indicator=ind_instance.identifier, indicator_label=ind_instance.title, indicator_doc=ind_instance.abstract.replace("\n", "\n "), docker_path=docker_path, docker_image=docker_image, indicator_inputs=inputs, ) filename = Path(f"cwl/{name}.yml") with open(filename, "w") as f: f.write(cwl) # for each indicator, also generate a step and add to the master CWL.abs param_list = '\n '.join(param_list) step = step_str.format( indicator_id=name, indicator=name, file = filename.name, params=param_list ) steps.append(step.replace("\n", "\n ")) break master_cwl = master_str.format( steps="\n ".join(steps), params="\n ".join([p.replace("\n", "\n ") for p in param_fields]), ) logger.info("Writing master CWL") with open("cwl/master.yml", "w") as f: f.write(master_cwl) ```
#### Creating a docker image for xclim: ```docker FROM python:3.10-slim WORKDIR /app RUN pip install xclim loguru h5netcdf --no-cache-dir USER root COPY cwl.py . COPY *.yaml . RUN mkdir /app/cwl #RUN python -m compileall `python -c "from distutils.sysconfig import get_python_lib; print(get_python_lib())"` USER $USER ```
#### Templates for the CWL generator: `cwl_template.yml`: ```yaml cwlVersion: v1.2 class: CommandLineTool id: xclim_{indicator} label: {indicator_label} doc: | {indicator_doc} requirements: EnvVarRequirement: envDef: PYTHONPATH: {docker_path} ResourceRequirement: coresMax: 1 ramMax: 512 hints: DockerRequirement: dockerPull: {docker_image} baseCommand: ["xclim"] arguments: [] inputs: input: type: File inputBinding: position: 0 prefix: --input output: type: string inputBinding: position: 1 prefix: --output {indicator_inputs} outputs: outdir: outputBinding: glob: "*.nc" type: File[] ``` `cwl_step.yml`: ```yaml {indicator}: run: {file} when: $( (inputs.indicator == {indicator} ) in: input: input output: output {params} out: outdir: outdir ``` `cwl_master.yml` ```yaml cwlVersion: v1.2 $graph: - class: Workflow requirements: - MultipleInputFeatureRequirement - SubworkflowFeatureRequirement - InlineJavascriptRequirement - DockerRequirement inputs: input: type: string output: type: string indicator: type: string {params} steps: {steps} outputs: outdir: type: File outputSource: valueFrom: ${{ inputs.indicator + '/outdir' }} ```
#### Commands for docker/podman, running the CWL: Build the image: `podman build -t localhost/xclim:latest .` Create the CWL files: `podman run -v $(pwd)/cwl/:/app/cwl -v $(pwd)/cwl.py:/app/cwl.py localhost/xclim:latest python /app/cwl.py`: Run Indicator calculations: `cwltool --podman --outdir runs cwl/TX_MAX.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --TX_MAX.freq ME` (not working) run indicator calculations thru master CWL: `cwltool --podman --outdir runs cwl/master.yml --input data/daily_surface_cancities_1990-1993.nc --output out.nc --indicator TX_MAX --TX_MAX.freq MS`

Related issues: #1949

This idea came up during the CLINT/OGC code sprint in Bonn, this October.

Contribution

Code of Conduct

fmigneault commented 1 month ago

I think the TX_MAX could be simplified like so:

inputs:
  indice:
    type: 
      type: enum
      symbols:
      - tx_max
    inputBinding:
      position: 2

  tasmax:
    doc: Maximum daily temperature.
    type: string?
    default: tasmax
    inputBinding:
      prefix: --tasmax

  freq:
    doc: Resampling frequency.
    type: 
    - "null"
    - type: enum
      symbols:
      - YS
      - MS
      # ... ?
    default: YS
    inputBinding:
      prefix: --freq 

Any input that has a very specific set of values should define a type: enum/symbols.

This should result (with the other inputs I didn't repeat), to a call like:

docker run \
  [...volume mounts, user-id map opts, etc. ...] \
  localhost/xclim:latest \
  xclim \
  --input /tmp/mounted/file.nc \
  --output /tmp/mounted/out.nc \
  tx_max \
  --tas tasmax \
  --freq YS 

Defining the inputs this way makes them look more natural/similar to what xclim expect (ie: using cwltool ... --freq MS rather than cwltool ... --TX_MAX.freq MS).

Similarly, using a job file would use names and values that are easier to define:

input:
  class: File
  path: "/path/to/file.nc"
freq: MS
tasmax: tasmax

To me, it sounds odd that there would be any start-up latency from the container if it was prebuilt, and that nothing triggers rebuilding it each time (modified file in context for example).

I have noticed that calling xclim by itself has a noticeable start-up latency, so not sure if the container is actually at cause at all.


Another thing to consider when defining the CWL. xclim takes as input a --output file path. However, the actual path from the point of view of the CWL/container will be mounted volumes with temporary dirs to do the processing.

Therefore, the path doesn't really matter. Only the file name does. The CWL output should do a glob considering this. Something along the lines of :

inputs:
  output: 
    type: string
    inputBinding:
      position: 1
      prefix: --output
    valueFrom: "$(runtime.outdir)/$(self.basename)"

outputs:
  output:
    type: File
    outputBinding:
      glob: "$(inputs.output.basename)"

Then, calling CWL with:

cwltool --outdir /tmp xclim-tasmxa.cwl --input /path/to/in.nc --output result.nc --freq YS

Should create the file /tmp/result.nc. But what CWL will have done is actually mount the created temp dirs, retrieve the output from the runtime dir, and stage out the output to the requested --outdir.