Allow WorkChains to expose the inputs/outputs of a sub WorkChain

sphuber commented 7 years ago

A common design pattern that occurs when developing WorkChains is that a certain WorkChain is wrapped by another WorkChain. As a result the wrapping WorkChain also needs to add the exact same input port as the WorkChain that it will be calling. For a WorkChain that wraps many other WorkChains this can become quite messy. We need a way for a wrapping WorkChain to "expose" the inputs of the sub WorkChain that it calls under the hood.

vdikan commented 7 years ago

Dear All, this new topic rises a question I also had in mind:

What is best way to achieve modular structure for workflows?

For example, if the same scf-cycle is called through workflow outline, there should be a way for abstracting it's call. Based on existing examples two approaches come to mind: either procedurally build spec.outline, or submitting creation of some workflow objects (classes?) within others. In practice, while developing a plugin for Siesta, we (Alberto and me) faced instability of implementations of both these approaches. Since here we deal with very high level of abstractions, a clear specification of how-to do this is needed.

greschd commented 7 years ago

Note: This is a continuation of the discussion on PR #544.

greschd commented 7 years ago

@vdikan, can you explain your use case in a bit more detail? Are you trying to use the same SCF sub-workflow in different workflows, or different SCF sub-workflows in the same workflow?

The expose functionality should allow for easily wrapping sub-workflows, by writing spec.expose_inputs(SubWorkflow) in the outline, and then submit(SubWorkflow, **self.exposed_inputs(SubWorkflow)) in the WF step.

Another requirement for modular WFs is that you can easily change which sub-WF is actually called. For example, you might want to change which code calculates some value, but then the rest of the workflow remains the same. For this purpose I created #640, so that you can give a workflow class as input.

I'm planning to write up how modular WFs are done in my tight-binding extraction project. If / when all the features for that are in develop we can also put that in the documentation.

greschd commented 7 years ago

@sphuber Another note on the "wrapping" (namespace) of input variables: I found some cases where it's quite desirable to have both "wrapped" and "unwrapped" inputs, from the same Sub-WF. The reason for this is that there are some inputs (e.g. the structure) which are shared among many WFs, and others which are specific to one.

In my "working" branch, I implemented it such that having multiple expose / inherit_inputs statements on the same Sub-WF works, and then inherited_inputs gives you all the inputs associated with that Sub-WF.

vdikan commented 7 years ago

@greschd I guess all that you mention is needed, as elementary workflows tend to be "building blocks" for more sophisticated ones.

Consider such a straightforward example: In case of Siesta the siesta executable and STM-imaging utility are registered as two different codes for AiiDA. The RelaxWorkflow (that re-runs scf-cycles if needed, doing geometry relaxation and checks for basic errors) is a subprocess of a majority of workflows done with Siesta, and STMWorkflow is also a standard workflow that in all cases takes a resulting folder of relaxation done with Siesta.

So, a few things come to mind:

RelaxWorkflow should be easily re-usable everywhere.
The whole GenerateSTM wf should look just as imports and calls to RelaxWorkflow and STMWorkflow with as less extra lines as possible.
Optional. Same for verdi graphs - would be great if they may be "foldable". Instead of what it draws for plain GenerateSTM wf (see 3150.pdf ) it would be great to be able to output it somewhat as: Inputs -> RelaxWorkflow -> STMWorkflow -> Outputs

greschd commented 7 years ago

Ok, I think these changes (together with #640) can help you with 1. and 2. I've never looked into the graph generation, so I wouldn't know how hard this would be to do.

vdikan commented 7 years ago

@greschd Thanks a lot! Where can one see an example of this workflow design implemented? For now we use patterns from tutorial and docs.

greschd commented 7 years ago

@vdikan That code is currently in our group's gitlab instance. I'm planning on publishing it / moving it to github some time in September. In the meantime I can privately send you some part that should show the workflow design principles.

@sphuber Another feature that would be great is to not only have an exclude option (inputs which are not exposed), but also an only option (only take these inputs). A good use case for this is when you expose parts at the global level (structure, etc.) and the rest within the "namespace", you can just write

global_subwf_inputs = ['structure']
spec.expose_inputs(SubWF, only=global_subwf_inputs)
spec.expose_inputs(SubWF, namespace='sub_wf', exclude=global_subwf_inputs)

instead of

spec.expose_inputs(SubWF, exclude=['all', 'the', 'other', 'inputs'])
spec.expose_inputs(SubWF, namespace='sub_wf', exclude=['structure'])

which is an anti-pattern (that I've found myself doing) because every time SubWF adds an input you have to go through the Workflows that wrap it.

greschd commented 7 years ago

@sphuber You wrote "expose inputs/outputs" in the title: I think we haven't talked about outputs before, but it seems like a great idea.

Is there a function to simultaneously do self.out for multiple output nodes? If so (or if we add it), the syntax for outputs could be something like

spec.expose_outputs(SubWF, ...) # same kwargs as for inputs

and in the step

self.out_many(**self.exposed_outputs(sub_wf_instance))

where sub_wf_instance ist the WorkCalculation you get from the ToContext. The type (SubWF) can be automatically detected from the instance.

It's not quite clear to me whether the namespace keyword also makes sense in this context. Do you plan to also allow the "nested dict" structures in the output, or only for inputs?

sphuber commented 7 years ago

No we didn't, but when writing the implementation I realized that when we had been discussing spec.expose we really meant spec.expose_inputs. That got me thinking whether the same concept could be abstracted to the output ports and whether that made sense. One could think of spec.expose_outputs as instructing the workchain to also automatically attach all non-dynamic output ports to itself. I think for high-level workchains this makes a lot of sense. It is the subworkchains that create the real outputs, but the end user does not necessarily want to have to go digging for those.

Originally I was thinking that when defining spec.expose_outputs(SubWF) the workchain would automatically add the links at the end of execution. So there would not even be a need to explicitly add this in the step itself. Have not really thought this through yet and whether this is something that is possible or that we want.

Regarding the namespace, it might make sense to then use that namespace as a prefix when generating the output return links. For example

class SubWF(WorkChain):
    def define(cls, spec):
        spec.output('structure', valid_type=Structure)

class WF(WorkChain):

    def define(cls, spec)
        spec.expose_outputs(SubWF, namespace='relax')

# Outputs of the `WF` workchain would then automatically look like
{
    'relax_structure': Structure
}

What do you think? Do you see any problems here?

greschd commented 7 years ago

Yeah, I think that makes a lot of sense. I'm a bit concerned about automatically attaching the outputs. How do we determine which sub-workflow to actually take the outputs from? We could go through the CALL links, but what happens if there's more than one such sub-workflow?

Something like this could be problematic:

in define:

spec.expose_inputs(SubWF, exclude=['a'])
spec.expose_outputs(SubWF)

and in the step

return ToContext(
    sub_1=submit(SubWF, a=Str('something'), **self.exposed_inputs),
    sub_2=submit(SubWF, a=Str('something else'), **self.exposed_inputs)
)

I think we would want something like

self.out_many(**self.exposed_outputs(self.ctx.sub_1, namespace='sub_1'))
self.out_many(**self.exposed_outputs(self.ctx.sub_2, namespace='sub_2'))

So, maybe we need the option to "override" the namespace when actually attaching the outputs?

greschd commented 7 years ago

Automatically adding the outputs (if there's only a single sub-wf of this type) could be an option given to the spec.expose_outputs? Or alternatively you could write in the spec which context item to take the output from.. but that restricts us to cases where the number of sub-WFs is known at define-time.

sphuber commented 7 years ago

Yeah, you are right about the automatic adding. If you run multiple instances of the same sub workchain it becomes unclear which outputs to automatically link. So in your example:

self.out_many(**self.exposed_outputs(self.ctx.sub_1, namespace='sub_1'))

the exposed_outputs method takes an actual Process instance and not the class and the namespace will be used, if passed, to prefix the links. That seems to make sense, although then the concept of a namespace in the spec.expose_outputs makes no sense at all right?

sphuber commented 7 years ago

Actually no it does.

spec.expose_outputs(SubWF, exclude=('structure'), namespace='bands')
spec.expose_outputs(SubWF, exclude=('bandstructure'), namespace='relax')

The output calls:

self.out_many(**self.exposed_outputs(self.ctx.subwf_bands, namespace='bands'))
self.out_many(**self.exposed_outputs(self.ctx.subwf_relax, namespace='relax'))

Will yield the following outputs:

WF.out = {
    'bands_parameters': ParameterData,
    'bands_bandstructure': BandsData,
    'relax_parameters': ParameterData,
    'relax_structure': StructureData,
}

greschd commented 7 years ago

Well, my initial idea was to still have it in spec.expose_outputs, but just allowing it to also be set (or overriden) after the fact. I'm a bit torn on this: On one hand I think it's better to have only one way of doing things, but then again it might actually be nicer for the common case to put it in define.

greschd commented 7 years ago

Yeah, that makes sense.. but this raises another question: What should we do if the names of the outputs are determined only at runtime? Say you want to run N instances of some workchain, and put the output for each in a separate "namespace". In the spec you can't really have all outputs because it's not known at define-time, but there should still be an easy way to forward the outputs.

greschd commented 7 years ago

Maybe there should be a separate keyword for this kind of thing.. or a special placeholder/dynamic namespace.

spec.expose_outputs(SubWF, namespace=DynamicNamespace)

and then

self.out_many(**self.exposed_outputs(self.ctx.subwf_1, namespace='sub_1'))
...
self.out_many(**self.exposed_outputs(self.ctx.subwf_n, namespace='sub_n')

which gives


WF.out = {
    'sub_1_parameters': ParameterData,
    'sub_1_bandstructure': BandsData,
    'sub_1_structure': StructureData,
    ....
    'sub_n_parameters': ParameterData,
    'sub_n_bandstructure': BandsData,
    'sub_n_structure': StructureData,
}

greschd commented 7 years ago

So there are sort of two ways in which we want to use "namespace" in the exposed_outputs: One is to restrict the outputs that are taken to the ones which are in that namespace, and the other is more like a prefix, where you create the "namespace" at runtime.

greschd commented 7 years ago

So maybe this:

spec.expose_outputs(SubWF, with_prefix=True)

and then

self.out_many(**self.exposed_outputs(self.ctx.subwf_1, prefix='sub_1'))
...
self.out_many(**self.exposed_outputs(self.ctx.subwf_n, prefix='sub_n')

I don't really know what should happen to the spec in this case, but it's desirable to have this kind of thing just for simplifying the self.out calls.

sphuber commented 7 years ago

Hmm it is becoming kind of complex now. I went through several iterations of the namespaced ports for the spec and am finishing up on the current one. I will push it soon to my fork so you can already have a look at the way it is setup now.

greschd commented 7 years ago

Yeah I agree... but maybe we don't need to do anything about the ports here? For this prefix we could just rely on the dynamic outputs. But yeah, this particular use case is also not the most common / important one right now. And doing it in the way you described above doesn't prohibit adding this feature later on.

sphuber commented 6 years ago

The basic functionality has now been implemented and merged in PR #1170 If improvements or new features are required, new tickets should be opened

aiidateam / aiida-core

Allow WorkChains to expose the inputs/outputs of a sub WorkChain #660