C2SM / Sirocco

AiiDA based Weather and climate workflow tool
1 stars 0 forks source link

workflow input data #28

Closed leclairm closed 3 weeks ago

leclairm commented 1 month ago

Currently we distinguish between data available at the start and sockets in very indirect way, by testing if it belongs to the unrolled tasks outputs. Such data should not get unrolled and get an unrolled_date attribute. We need a class for it, like WorkflowInputData or AbsData and the corresponding list in the Workflow object.

Potentially, we could also make this distinction even clearer, directly in the yaml file with sub sections of the data section, like

data:
    workflow input:
      - abs_data_1
      - abs_data_2
    sockets:
      - icon_output_stream_1
      - icon_output_stream_2

To be discussed

In the long run, having workflow input and output data could even help integrating it as an element in an AiiDA workflow.

agoscinski commented 1 month ago

In the long run, having workflow input and output data could even help integrating it as an element in an AiiDA workflow.

Because you also suggested that we could have outputs, I thought about inputs, intermediates, outputs, or to shorter in, inter, out that also distinguishes it a bit more from the inputs and outputs in the graph/cycle-tasks.

data:
    in:
      - abs_data_1
    inter:
      - icon_output_stream_2
    out:
      - density_file

It is also not a great naming, not intuitive, but I don't have a better idea. One could also make this keys of the data

data:
    - abs_data_1
        available_on_init: True
    - icon_output_stream_2
    - density_file
        expose_as_output: True

but it does not enforce a nice sorting. Maybe we just highlight the inputs and outputs and

data:
    inputs:
      - abs_data_1
    - icon_output_stream_2
    outputs:
      - density_file

Also not great.

leclairm commented 1 month ago

I think the last suggestion is not a valid yaml file.

I thought about available vs generated. It has the advantage of being rather clear but diverges from the notion of input and output at the workflow level (workflow output is necessarily generated). They are actually not mergeable ideas. So, for our current needs, I would go for clarity with stg like that

data: 
  available:
    - abs_data_1:
        path: /the/absolute/path_1
        type: folder
    - abs_data_2:
        path: /the/abs/path_2
        type: file
  generated:
    - output_task_1:
        path: relative/path_1
        type: folder
    - output_task_2:
        path: relative/path_2
        type: file

Integrating a CW workgraph as an element of a larger AiiDA workflow is still a vague idea without corresponding needs. If it really becomes a thing, we can add inputs and outputs sections at the root level of the yaml file with elements in there that must reference items in the data section (available or generated then).

agoscinski commented 1 month ago

I think the last suggestion is not a valid yaml file.

I think you are right. I had already some doubts when writing it.

I thought about available vs generated.

I am fine with your suggestion. I think generated is very clear and if available is not clear, one can infer it from generated what it exactly means.

we can add inputs and outputs sections at the root level of the yaml file

I think that is also good, by that it becomes clear that it is from the workflow

GeigerJ2 commented 1 month ago

Just chiming in here, as I think it's a suitable place for the discussion, rather than opening another issue: For actually running an ICON simulation through aiida-icon with our tool here, if given data (files or folders) are located on a remote machine, we should create AiiDA orm.RemoteData nodes. This is also given in the IconCalculation spec, e.g., see here:

        spec.input(
            "dynamics_grid_file",
            valid_type=orm.RemoteData,
        )
        spec.input("ecrad_data", valid_type=orm.RemoteData)

I assume this will be the case for most of the files that are available at the beginning of the workflow (apart from the namelists, e.g., grid files will live on the HPC). However, to generate the orm.RemoteData instances, one has to specify the remote computer on which these files are located. So I'm wondering if we should add another type to the data section, e.g., remote (note that AiiDA's orm.RemoteData doesn't differentiate between files and folders), as well as a computer input would be required:

data:
  - icon_model_namelist:  # -> This lives locally
        src: /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-icon/tests/data/simple_icon_run/inputs/model.namelist
        type: file
  - grid_file:  # -> This lives on the HPC
        src: <remote-path-on-HPC>/icon_grid_simple.nc
        type: remote
        computer: todi

We should also allow specifying the remote computer, e.g., Todi, in the root task, which then gets chosen as the default remote for such data nodes. Open to any other proposals, though.

leclairm commented 1 month ago

This is also true for the task themselves, they also need to be told, they have to run on the remote machine and this is the same machine for all of them, so this should probably be a root property of the workflow, like:

start_date: 2026-01-01T00:00
end_date: 2027-01-01T00:00
host: santis.cscs.ch
- cycles:
[...]

Everything that is considered configuration (ascii files), should:

GeigerJ2 commented 1 month ago

In CW use cases, you don't expect a single workflow to utilize different machines, I suppose? If that's not the case, then I agree, it could even be set as a global property in the beginning of the YAML file.

leclairm commented 1 month ago

In CW use cases, you don't expect a single workflow to utilize different machines, I suppose? If that's not the case, then I agree, it could even be set as a global property in the beginning of the YAML file.

No data is too large so a single workflow runs on a single machine. Worst case, we could need a task that downloads data from somewhere.