Closed sphuber closed 6 years ago
Dear All, this new topic rises a question I also had in mind:
What is best way to achieve modular structure for workflows?
For example, if the same scf-cycle is called through workflow outline,
there should be a way for abstracting it's call.
Based on existing examples two approaches come to mind: either procedurally
build spec.outline
, or submitting creation of some workflow objects (classes?)
within others. In practice, while developing a plugin for Siesta,
we (Alberto and me) faced instability of implementations of both these
approaches. Since here we deal with very high level of abstractions, a clear
specification of how-to do this is needed.
Note: This is a continuation of the discussion on PR #544.
@vdikan, can you explain your use case in a bit more detail? Are you trying to use the same SCF sub-workflow in different workflows, or different SCF sub-workflows in the same workflow?
The expose
functionality should allow for easily wrapping sub-workflows, by writing spec.expose_inputs(SubWorkflow)
in the outline
, and then submit(SubWorkflow, **self.exposed_inputs(SubWorkflow))
in the WF step.
Another requirement for modular WFs is that you can easily change which sub-WF is actually called. For example, you might want to change which code calculates some value, but then the rest of the workflow remains the same. For this purpose I created #640, so that you can give a workflow class as input.
I'm planning to write up how modular WFs are done in my tight-binding extraction project. If / when all the features for that are in develop
we can also put that in the documentation.
@sphuber Another note on the "wrapping" (namespace) of input variables: I found some cases where it's quite desirable to have both "wrapped" and "unwrapped" inputs, from the same Sub-WF. The reason for this is that there are some inputs (e.g. the structure) which are shared among many WFs, and others which are specific to one.
In my "working" branch, I implemented it such that having multiple expose
/ inherit_inputs
statements on the same Sub-WF works, and then inherited_inputs
gives you all the inputs associated with that Sub-WF.
@greschd I guess all that you mention is needed, as elementary workflows tend to be "building blocks" for more sophisticated ones.
Consider such a straightforward example:
In case of Siesta the siesta
executable and STM-imaging utility are registered as
two different codes for AiiDA. The RelaxWorkflow
(that re-runs scf-cycles if needed, doing geometry relaxation and checks for basic errors) is a subprocess of a majority of workflows done with Siesta, and STMWorkflow
is also a standard workflow that in all cases takes a resulting folder of relaxation done with Siesta.
So, a few things come to mind:
RelaxWorkflow
should be easily re-usable everywhere.GenerateSTM
wf should look just as imports and calls to RelaxWorkflow
and STMWorkflow
with as less extra lines as possible.GenerateSTM
wf (see 3150.pdf ) it would be great to be able to output it somewhat as:
Inputs -> RelaxWorkflow -> STMWorkflow -> Outputs
Ok, I think these changes (together with #640) can help you with 1. and 2. I've never looked into the graph generation, so I wouldn't know how hard this would be to do.
@greschd Thanks a lot! Where can one see an example of this workflow design implemented? For now we use patterns from tutorial and docs.
@vdikan That code is currently in our group's gitlab instance. I'm planning on publishing it / moving it to github some time in September. In the meantime I can privately send you some part that should show the workflow design principles.
@sphuber Another feature that would be great is to not only have an exclude
option (inputs which are not exposed), but also an only
option (only take these inputs). A good use case for this is when you expose parts at the global level (structure, etc.) and the rest within the "namespace", you can just write
global_subwf_inputs = ['structure']
spec.expose_inputs(SubWF, only=global_subwf_inputs)
spec.expose_inputs(SubWF, namespace='sub_wf', exclude=global_subwf_inputs)
instead of
spec.expose_inputs(SubWF, exclude=['all', 'the', 'other', 'inputs'])
spec.expose_inputs(SubWF, namespace='sub_wf', exclude=['structure'])
which is an anti-pattern (that I've found myself doing) because every time SubWF adds an input you have to go through the Workflows that wrap it.
@sphuber You wrote "expose inputs/outputs" in the title: I think we haven't talked about outputs before, but it seems like a great idea.
Is there a function to simultaneously do self.out
for multiple output nodes? If so (or if we add it), the syntax for outputs could be something like
spec.expose_outputs(SubWF, ...) # same kwargs as for inputs
and in the step
self.out_many(**self.exposed_outputs(sub_wf_instance))
where sub_wf_instance
ist the WorkCalculation
you get from the ToContext
. The type (SubWF
) can be automatically detected from the instance.
It's not quite clear to me whether the namespace
keyword also makes sense in this context. Do you plan to also allow the "nested dict" structures in the output, or only for inputs?
No we didn't, but when writing the implementation I realized that when we had been discussing spec.expose
we really meant spec.expose_inputs
. That got me thinking whether the same concept could be abstracted to the output ports and whether that made sense. One could think of spec.expose_outputs
as instructing the workchain to also automatically attach all non-dynamic output ports to itself. I think for high-level workchains this makes a lot of sense. It is the subworkchains that create the real outputs, but the end user does not necessarily want to have to go digging for those.
Originally I was thinking that when defining spec.expose_outputs(SubWF)
the workchain would automatically add the links at the end of execution. So there would not even be a need to explicitly add this in the step itself. Have not really thought this through yet and whether this is something that is possible or that we want.
Regarding the namespace, it might make sense to then use that namespace as a prefix when generating the output return links. For example
class SubWF(WorkChain):
def define(cls, spec):
spec.output('structure', valid_type=Structure)
class WF(WorkChain):
def define(cls, spec)
spec.expose_outputs(SubWF, namespace='relax')
# Outputs of the `WF` workchain would then automatically look like
{
'relax_structure': Structure
}
What do you think? Do you see any problems here?
Yeah, I think that makes a lot of sense. I'm a bit concerned about automatically attaching the outputs. How do we determine which sub-workflow to actually take the outputs from? We could go through the CALL links, but what happens if there's more than one such sub-workflow?
Something like this could be problematic:
in define:
spec.expose_inputs(SubWF, exclude=['a'])
spec.expose_outputs(SubWF)
and in the step
return ToContext(
sub_1=submit(SubWF, a=Str('something'), **self.exposed_inputs),
sub_2=submit(SubWF, a=Str('something else'), **self.exposed_inputs)
)
I think we would want something like
self.out_many(**self.exposed_outputs(self.ctx.sub_1, namespace='sub_1'))
self.out_many(**self.exposed_outputs(self.ctx.sub_2, namespace='sub_2'))
So, maybe we need the option to "override" the namespace when actually attaching the outputs?
Automatically adding the outputs (if there's only a single sub-wf of this type) could be an option given to the spec.expose_outputs
? Or alternatively you could write in the spec which context item to take the output from.. but that restricts us to cases where the number of sub-WFs is known at define
-time.
Yeah, you are right about the automatic adding. If you run multiple instances of the same sub workchain it becomes unclear which outputs to automatically link. So in your example:
self.out_many(**self.exposed_outputs(self.ctx.sub_1, namespace='sub_1'))
the exposed_outputs
method takes an actual Process
instance and not the class and the namespace will be used, if passed, to prefix the links. That seems to make sense, although then the concept of a namespace in the spec.expose_outputs
makes no sense at all right?
Actually no it does.
spec.expose_outputs(SubWF, exclude=('structure'), namespace='bands')
spec.expose_outputs(SubWF, exclude=('bandstructure'), namespace='relax')
The output calls:
self.out_many(**self.exposed_outputs(self.ctx.subwf_bands, namespace='bands'))
self.out_many(**self.exposed_outputs(self.ctx.subwf_relax, namespace='relax'))
Will yield the following outputs:
WF.out = {
'bands_parameters': ParameterData,
'bands_bandstructure': BandsData,
'relax_parameters': ParameterData,
'relax_structure': StructureData,
}
Well, my initial idea was to still have it in spec.expose_outputs
, but just allowing it to also be set (or overriden) after the fact. I'm a bit torn on this: On one hand I think it's better to have only one way of doing things, but then again it might actually be nicer for the common case to put it in define
.
Yeah, that makes sense.. but this raises another question: What should we do if the names of the outputs are determined only at runtime? Say you want to run N instances of some workchain, and put the output for each in a separate "namespace". In the spec you can't really have all outputs because it's not known at define
-time, but there should still be an easy way to forward the outputs.
Maybe there should be a separate keyword for this kind of thing.. or a special placeholder/dynamic namespace.
spec.expose_outputs(SubWF, namespace=DynamicNamespace)
and then
self.out_many(**self.exposed_outputs(self.ctx.subwf_1, namespace='sub_1'))
...
self.out_many(**self.exposed_outputs(self.ctx.subwf_n, namespace='sub_n')
which gives
WF.out = {
'sub_1_parameters': ParameterData,
'sub_1_bandstructure': BandsData,
'sub_1_structure': StructureData,
....
'sub_n_parameters': ParameterData,
'sub_n_bandstructure': BandsData,
'sub_n_structure': StructureData,
}
So there are sort of two ways in which we want to use "namespace" in the exposed_outputs
: One is to restrict the outputs that are taken to the ones which are in that namespace, and the other is more like a prefix
, where you create the "namespace" at runtime.
So maybe this:
spec.expose_outputs(SubWF, with_prefix=True)
and then
self.out_many(**self.exposed_outputs(self.ctx.subwf_1, prefix='sub_1'))
...
self.out_many(**self.exposed_outputs(self.ctx.subwf_n, prefix='sub_n')
I don't really know what should happen to the spec
in this case, but it's desirable to have this kind of thing just for simplifying the self.out
calls.
Hmm it is becoming kind of complex now. I went through several iterations of the namespaced ports for the spec and am finishing up on the current one. I will push it soon to my fork so you can already have a look at the way it is setup now.
Yeah I agree... but maybe we don't need to do anything about the ports here? For this prefix
we could just rely on the dynamic
outputs.
But yeah, this particular use case is also not the most common / important one right now. And doing it in the way you described above doesn't prohibit adding this feature later on.
The basic functionality has now been implemented and merged in PR #1170 If improvements or new features are required, new tickets should be opened
A common design pattern that occurs when developing
WorkChains
is that a certainWorkChain
is wrapped by anotherWorkChain
. As a result the wrappingWorkChain
also needs to add the exact same input port as theWorkChain
that it will be calling. For aWorkChain
that wraps many otherWorkChains
this can become quite messy. We need a way for a wrappingWorkChain
to "expose" the inputs of the subWorkChain
that it calls under the hood.