Closed leclairm closed 3 weeks ago
In the long run, having workflow input and output data could even help integrating it as an element in an AiiDA workflow.
Because you also suggested that we could have outputs, I thought about inputs
, intermediates
, outputs
, or to shorter in
, inter
, out
that also distinguishes it a bit more from the inputs
and outputs
in the graph/cycle-tasks.
data:
in:
- abs_data_1
inter:
- icon_output_stream_2
out:
- density_file
It is also not a great naming, not intuitive, but I don't have a better idea. One could also make this keys of the data
data:
- abs_data_1
available_on_init: True
- icon_output_stream_2
- density_file
expose_as_output: True
but it does not enforce a nice sorting. Maybe we just highlight the inputs and outputs and
data:
inputs:
- abs_data_1
- icon_output_stream_2
outputs:
- density_file
Also not great.
I think the last suggestion is not a valid yaml file.
I thought about available
vs generated
. It has the advantage of being rather clear but diverges from the notion of input and output at the workflow level (workflow output is necessarily generated). They are actually not mergeable ideas. So, for our current needs, I would go for clarity with stg like that
data:
available:
- abs_data_1:
path: /the/absolute/path_1
type: folder
- abs_data_2:
path: /the/abs/path_2
type: file
generated:
- output_task_1:
path: relative/path_1
type: folder
- output_task_2:
path: relative/path_2
type: file
Integrating a CW workgraph as an element of a larger AiiDA workflow is still a vague idea without corresponding needs. If it really becomes a thing, we can add inputs
and outputs
sections at the root level of the yaml file with elements in there that must reference items in the data
section (available
or generated
then).
I think the last suggestion is not a valid yaml file.
I think you are right. I had already some doubts when writing it.
I thought about available vs generated.
I am fine with your suggestion. I think generated
is very clear and if available
is not clear, one can infer it from generated
what it exactly means.
we can add inputs and outputs sections at the root level of the yaml file
I think that is also good, by that it becomes clear that it is from the workflow
Just chiming in here, as I think it's a suitable place for the discussion, rather than opening another issue:
For actually running an ICON simulation through aiida-icon
with our tool here, if given data (files or folders) are located on a remote machine, we should create AiiDA orm.RemoteData
nodes. This is also given in the IconCalculation
spec, e.g., see here:
spec.input(
"dynamics_grid_file",
valid_type=orm.RemoteData,
)
spec.input("ecrad_data", valid_type=orm.RemoteData)
I assume this will be the case for most of the files that are available at the beginning of the workflow (apart from the namelists, e.g., grid files will live on the HPC). However, to generate the orm.RemoteData
instances, one has to specify the remote computer on which these files are located. So I'm wondering if we should add another type
to the data
section, e.g., remote
(note that AiiDA's orm.RemoteData
doesn't differentiate between files and folders), as well as a computer
input would be required:
data:
- icon_model_namelist: # -> This lives locally
src: /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-icon/tests/data/simple_icon_run/inputs/model.namelist
type: file
- grid_file: # -> This lives on the HPC
src: <remote-path-on-HPC>/icon_grid_simple.nc
type: remote
computer: todi
We should also allow specifying the remote computer, e.g., Todi, in the root
task, which then gets chosen as the default remote for such data nodes. Open to any other proposals, though.
This is also true for the task themselves, they also need to be told, they have to run on the remote machine and this is the same machine for all of them, so this should probably be a root property of the workflow, like:
start_date: 2026-01-01T00:00
end_date: 2027-01-01T00:00
host: santis.cscs.ch
- cycles:
[...]
Everything that is considered configuration (ascii files), should:
In CW use cases, you don't expect a single workflow to utilize different machines, I suppose? If that's not the case, then I agree, it could even be set as a global property in the beginning of the YAML file.
In CW use cases, you don't expect a single workflow to utilize different machines, I suppose? If that's not the case, then I agree, it could even be set as a global property in the beginning of the YAML file.
No data is too large so a single workflow runs on a single machine. Worst case, we could need a task that downloads data from somewhere.
Currently we distinguish between data available at the start and sockets in very indirect way, by testing if it belongs to the unrolled tasks outputs. Such data should not get unrolled and get an
unrolled_date
attribute. We need a class for it, likeWorkflowInputData
orAbsData
and the corresponding list in theWorkflow
object.Potentially, we could also make this distinction even clearer, directly in the yaml file with sub sections of the
data
section, likeTo be discussed
In the long run, having workflow input and output data could even help integrating it as an element in an AiiDA workflow.