Closed GeigerJ2 closed 1 month ago
We rather need a call for this one I think.
With the core.py
PR #19 and unrolling the cycles we do this now. Do you think we can close this issue @GeigerJ2 or is there still something not implemented that you mention?
EDIT: As of now (2024-06-06), the input / output file dependencies are correctly captured as AiiDA data nodes inside each
WcTask
, so that's fine. However, inaiida-shell
and to construct aWorkGraph
, we chain Processes together, while explicitly passing data input / output between them. In the current version of the graph, edges are created between processes and data nodes. Thus, the direct connections between processes are implicitly contained, but for constructing the AiiDA workflow, we must also obtain the explicit connections between the processes. Will work on this now.Information on the individual
data
andtasks
is stored underWcGraph.data
andWcGraph.tasks
, respectively. These instance attributes are set up as dictionaries with the names of the tasks / data entities as top-level keys, and the relevantdatetime
objects as second-level keys (see also issue #4). What is missing from this representation, however, are the actual input / output dependencies of the concrete tasks. These seem to be generated on the fly through_add_edges_from_cycle
for each "cycle". What would be useful to have is for each task a list of actual concrete inputs and outputs that are supposed to be attached to it (concrete meaning here not just "stream 1", but "stream 1" with the respective datetime attached to it).I'm wondering if, for passing the relevant data to AiiDA WorkGraph, it might make our lives easier to generate such an intermediate representation that fully defines all concrete dependencies. Eventually, we'll need to pass concrete (AiiDA) nodes, so we have to resolve dates and dependencies (currently, dates are resolved in
WcGraph.data
andWcGraph.tasks
, as mentioned above, but in the_add_edges_from_cycle
method, I still seelag
s being passed around). So one could actually: 1) Go once through the full workflow definition, and 2) Generate the concrete intermediate representation with resolved dates and dependencies. These could be either (string) references, such asinputs=[f"{WcData.name}-{WcData.start_date", ...]
, or even already point to unstored AiiDA data nodes, if feasible, and finally 3) Generate the concreteWorkGraph
from this intermediate representation.It might still be possible to build up the
WorkGraph
on the fly (I currently instantiate it as an instance attribute ofWcGraph
), by adding the relevant code in the_add_edges_from_cycle
method._add_nodes_from_cycle
(or a similar method) will likely not be used when building up theWorkGraph
as we don't consider isolated nodes, but instead processes with input / output data nodes attached to them. I think the current approach is different from how we would construct the Workgraph: 1) Visualization PoC: Currently, when generating the graphical representation, individual nodes are added and then edges are drawn between them, for each cycle separately, and adding subgraphs for each "cluster" (individual task of a "cycle", e.g.icon
; though, these seem to only be necessary for the shading, all connections in the graph are present even without adding the subgraphs). 2) AiiDA WorkGraph: As mentioned above, for the WorkGraph, we might resolve the entire workflow / graph, and then build it up using concrete nodes. I'm not sure if / how this would work with AiiDA nodes already, as intermediate data nodes would have to be non-existent, anticipatedSinglefileData
nodes.I haven't thought this through fully, so this is mainly to capture these ideas. I hope it makes somewhat sense. Pinging @agoscinski for information.