Intermediate representation containing concrete input / output dependencies of tasks

GeigerJ2 commented 5 months ago

EDIT: As of now (2024-06-06), the input / output file dependencies are correctly captured as AiiDA data nodes inside each WcTask, so that's fine. However, in aiida-shell and to construct a WorkGraph, we chain Processes together, while explicitly passing data input / output between them. In the current version of the graph, edges are created between processes and data nodes. Thus, the direct connections between processes are implicitly contained, but for constructing the AiiDA workflow, we must also obtain the explicit connections between the processes. Will work on this now.

Information on the individual data and tasks is stored under WcGraph.data and WcGraph.tasks, respectively. These instance attributes are set up as dictionaries with the names of the tasks / data entities as top-level keys, and the relevant datetime objects as second-level keys (see also issue #4). What is missing from this representation, however, are the actual input / output dependencies of the concrete tasks. These seem to be generated on the fly through _add_edges_from_cycle for each "cycle". What would be useful to have is for each task a list of actual concrete inputs and outputs that are supposed to be attached to it (concrete meaning here not just "stream 1", but "stream 1" with the respective datetime attached to it).

I'm wondering if, for passing the relevant data to AiiDA WorkGraph, it might make our lives easier to generate such an intermediate representation that fully defines all concrete dependencies. Eventually, we'll need to pass concrete (AiiDA) nodes, so we have to resolve dates and dependencies (currently, dates are resolved in WcGraph.data and WcGraph.tasks, as mentioned above, but in the _add_edges_from_cycle method, I still see lags being passed around). So one could actually: 1) Go once through the full workflow definition, and 2) Generate the concrete intermediate representation with resolved dates and dependencies. These could be either (string) references, such as inputs=[f"{WcData.name}-{WcData.start_date", ...], or even already point to unstored AiiDA data nodes, if feasible, and finally 3) Generate the concrete WorkGraph from this intermediate representation.

It might still be possible to build up the WorkGraph on the fly (I currently instantiate it as an instance attribute of WcGraph), by adding the relevant code in the _add_edges_from_cycle method. _add_nodes_from_cycle (or a similar method) will likely not be used when building up the WorkGraph as we don't consider isolated nodes, but instead processes with input / output data nodes attached to them. I think the current approach is different from how we would construct the Workgraph: 1) Visualization PoC: Currently, when generating the graphical representation, individual nodes are added and then edges are drawn between them, for each cycle separately, and adding subgraphs for each "cluster" (individual task of a "cycle", e.g. icon; though, these seem to only be necessary for the shading, all connections in the graph are present even without adding the subgraphs). 2) AiiDA WorkGraph: As mentioned above, for the WorkGraph, we might resolve the entire workflow / graph, and then build it up using concrete nodes. I'm not sure if / how this would work with AiiDA nodes already, as intermediate data nodes would have to be non-existent, anticipated SinglefileData nodes.

I haven't thought this through fully, so this is mainly to capture these ideas. I hope it makes somewhat sense. Pinging @agoscinski for information.

leclairm commented 5 months ago

We rather need a call for this one I think.

agoscinski commented 1 month ago

With the core.py PR #19 and unrolling the cycles we do this now. Do you think we can close this issue @GeigerJ2 or is there still something not implemented that you mention?

C2SM / Sirocco

Intermediate representation containing concrete input / output dependencies of tasks #6