Closed danielhollas closed 1 year ago
Thanks for the detailed report @danielhollas . I am a bit confused as well. I am pretty sure I know where the error comes from, however, I don't understand how 0.21.6 would have broken that. Is it possible that it didn't show up with 0.21.5 simply because the other bug that was fixed in 0.21.6 hit before the issue you report here?
The error message you see comes from the validator that is set on the top-level input namespace of the base CalcJob
https://github.com/aiidateam/aiida-core/blob/9de3f90a157ec7c9f99c1f17f3ca3548f6d96bba/aiida/engine/processes/calcjobs/calcjob.py#L42
The code
input is not required (on the port declaration itself), but the validator will check that it has been defined if there is also no remote_folder
input, which would be a calcjob import run and is the only case where a code
input is not necessary. In your workchain, you exclude the code
input, but the validator on the CalcJob
top-level inputs remains. This is a problem that has cropped up before and we don't have a better solution for this now.
The design problem boils down to this:
The problem is that it is impossible (practically) to automatically correct validators, or even detect if they would be affected by its namespace being exposed with only part of its ports, which would allow us maybe to disable it.
For now, the pragmatic solution has been for validators to account for ports being disabled. The CalcJob.validator
does this in fact, where it checks whether the code
and metadata.computer
ports are actually still there:https://github.com/aiidateam/aiida-core/blob/main/aiida/engine/processes/calcjobs/calcjob.py#L54-L59
If not, it will skip the validation.
I think the conclusion then, is that there simply is a bug in the CalcJob
validator:https://github.com/aiidateam/aiida-core/blob/9de3f90a157ec7c9f99c1f17f3ca3548f6d96bba/aiida/engine/processes/calcjobs/calcjob.py#L69
It makes no sense to try and get the code
input and catch the exception if it doesn't exist and set it to None
, to only proceed and attempt to get the computer
, which will always result in the AttributeError
that you hit. I think in this case, there should also be a check that the namespace still has a code
input, and only then try to get it from the values. Otherwise, skip validation.
I will transfer the issue and open a PR in aiida-core
to see if that fixes it
Is it possible that it didn't show up with 0.21.5 simply because the other bug that was fixed in 0.21.6 hit before the issue you report here?
I am sure because I have patched my workflow against the boolean issue when testing this. Also there really were not many other changes between 0.21.3 (which I was using until now) and 0.21.6.
The problem is that it is impossible (practically) to automatically correct validators, or even detect if they would be affected by its namespace being exposed with only part of its ports, which would allow us maybe to disable it.
Yeah, that's tough.
I should maybe clarify since it is not clear from the traceback, that in my particular example, the error actually is triggered when I am submitting the workchain, not the CalcJob itself. So it looks like the input ports are validated recursively when the workchain is submitted, which is where the problem is because at that point not all the CalcJob ports are present. But I assume the same validator is then run again when the actual CalcJob is submitted.
I should maybe clarify since it is not clear from the traceback, that in my particular example, the error actually is triggered when I am submitting the workchain, not the CalcJob itself. So it looks like the input ports are validated recursively when the workchain is submitted, which is where the problem is because at that point not all the CalcJob ports are present.
This is by design though and therefore expected. That is the reason the exclude
key was added to expose_outputs
to remove ports that won't be defined at launch time of the wrapping workchain.
But I assume the same validator is then run again when the actual CalcJob is submitted.
Indeed, but then it will be validated independent of the context of the wrapping workchain. So if the wrapper excluded a certain port, when you launch the actual sub class in that workchain, the original spec is used including the port that was excluded.
I had a closer look and am not sure why your example is failing. I created this MWE that should reproduce your use-case:
#!/usr/bin/env python
from plumpy import Process
class SubProcess(Process):
@classmethod
def define(cls, spec):
super().define(spec)
spec.input('a')
spec.input('b')
spec.inputs.validator = cls.validate_inputs
@classmethod
def validate_inputs(cls, value, ctx):
try:
ctx.get_port('a')
except ValueError:
print('a not in ctx')
return None
if not isinstance('a', str):
return f'input `a` should be a str, got: {type(str)}'
class MasterProcess(Process):
@classmethod
def define(cls, spec):
super().define(spec)
spec.expose_inputs(SubProcess, 'sub', exclude=('a',))
spec.input('a')
print(MasterProcess.spec())
inputs = {
'a': 'test',
'sub': {
'b': 'alt'
}
}
print('VALIDATION:', MasterProcess.spec().inputs.validate(inputs))
It passes validation just fine and prints:
{
"inputs": {
"_attrs": {
"default": [],
"dynamic": false,
"help": null,
"required": "True",
"valid_type": "None"
},
"a": {
"name": "a",
"required": "True"
},
"sub": {
"_attrs": {
"default": [],
"dynamic": false,
"help": null,
"required": "True",
"valid_type": "None"
},
"b": {
"name": "b",
"required": "True"
}
}
},
"outputs": {
"_attrs": {
"default": [],
"dynamic": false,
"help": null,
"required": "True",
"valid_type": "None"
}
}
}
a not in ctx
VALIDATION: None
Now as you already noticed, it probably has to do that get_port
will now create the port on the fly if the namespace is dynamic. In my example, it clearly is not dynamic and so the code works as expected. Looking at your code that also seems to be the case, but maybe it is not? It would be very useful if you could run print(OrcaWignerSpectrumWorkChain.spec())
to make sure those exposed namespaces aren't dynamic after all. My suspicion is that they are, because otherwise I wouldn't be able to explain the behavior.
@danielhollas On another note, we had planned to release v2.3.0
today. Since this change was introduced in plumpy
and the latest aiida-core
release specified 0.21.3
that would also automatically install the latest plumpy, it is not v2.3 that is really breaking this. I will therefore go ahead and release anyway. We can then continue investigating this and release a patch for plumpy
or aiida-core
as well, as needed.
Here's the output of print(OrcaWignerSpectrumWorkChain.spec())
. Indeed the namespace is dynamic.
Hmm, the dynamic namespace seems to come out from the OrcaBaseWorkChain for some reason.
Output from w = OrcaWorkflowFactory('orca.base') ; print(w.spec().inputs)
@sphuber here's a modified MWE that triggers the error. The substantive change is that SubProcess
subclasses CalcJob
instead of Process
. This change for some mysterious reason changes the namespaces from expose_inputs
to dynamic. (note that I also had to change/fix the validator to see the problem)
#!/usr/bin/env python
from plumpy import Process
from aiida.engine import CalcJob
class SubProcess(CalcJob):
@classmethod
def define(cls, spec):
super().define(spec)
spec.input('a')
spec.input('b')
spec.inputs.validator = cls.validate_inputs
@classmethod
def validate_inputs(cls, inputs, ctx):
try:
ctx.get_port('a')
except ValueError:
print('a not in ctx')
return None
if 'a' not in inputs:
return f'input `a` is missing!'
if 'b' not in inputs:
return f'input `b` is missing!'
a = value['a']
if not isinstance(a, str):
return f'input `a` should be a str, got: {type(a)}'
b = value['b']
if not isinstance(b, str):
return f'input `b` should be a str, got: {type(b)}'
class MasterProcess(Process):
@classmethod
def define(cls, spec):
super().define(spec)
spec.expose_inputs(SubProcess, 'sub', exclude=('a',))
spec.input('a')
print(MasterProcess.spec().inputs)
inputs = {
'a': 'test',
'sub': {
'b': 'alt'
}
}
print('VALIDATION:', MasterProcess.spec().inputs.validate(inputs))
Here's the output
Regardless of why the input namespace is dynamic, this anyway should be fixed right?
From a design point of view, it seems a bit strange that a function called get_port
would have such a big sideeffect of creating new dynamic ports. Should that be handled somewhere more up the stack, or via another function like get_port_or_create_dynamic
.
I looked at other usage of get_port
in aiida-core and the only other one is in exposed_inputs
.
I wonder if other unintended side effects are lurking there?
Hi both, I am reviewing #268 but I am still confused about what actually happened in @danielhollas's case. The MWE is clearly shows the problem comes when the validator calls the get_port
and creates the excluded port as dynamic. But which is such a validator in @danielhollas's process, it calls get_port explicitly?
@unkcpz the particular validator that triggered this is this CalcJob validator in aiida-core.
The problem comes from the following "conflicting" design requirements:
So, since we know that people will likely exclude certain ports of a CalcJob
when exposing it, we can no longer assume in the CalcJob.validate_inputs
validator that all ports are still there when it is called. To workaround this, I changed validators to have the PortNamespace
itself be passed in as the ctx
argument, the last argument of the validator. This can then be used to see if the target port is still there. If it isn't, it means whoever exposed that namespace, removed that port, and so we can no longer validate it. This is why these ctx.get_port
calls are in the validator.
Now the problem was that the change I added to make dynamic output namespaces act recursively, it would also automatically recreate the input port that was excluded, and so now the validator would think it should validate it again, but of course no value was provided and so it failed.
Hope that helps clear things up a bit
@sphuber Thank you for the plumpy PR, I am just looking at it.
Unrelated to that, did you manage to figure out why the namespace exposed from a CalcJob class is dynamic?
Also, I think it would be nice to cleanup the CalcJob validator, since I believe there is a logic error at the beginning.
try:
ctx.get_port('code')
ctx.get_port('metadata.computer')
except ValueError:
# If the namespace no longer contains the `code` or `metadata.computer` ports we skip validation
return None
This looks like the validator is never run, unless I provide both 'code' and 'metadata.computer', which is often redundant. I believe the validator should only be skipped if the 'code' post has been excluded, am I missing something?
Unrelated to that, did you manage to figure out why the namespace exposed from a CalcJob class is dynamic?
No, that I haven't looked at yet, since, as you pointed out, that is not the real problem. Might still have a look later, but don't think it is that important for now.
This looks like the validator is never run, unless I provide both 'code' and 'metadata.computer', which is often redundant.
Not quite. The ctx
is the actual PortNamespace
, so that code is checking whether the ports are still there, not the values that are passed for it in the inputs. I think the check should still be there because the purpose is to make sure that if both the code
and metadata.computer
are passed in the inputs, then they should be commensurate. But if either of them are no longer part of the input namespace, then there really isn't anything to validate.
That being said, there is still a minor inconsistency (the line doing computer_from_code = code.computer
after we get the code knowing it may be None
), so I might still open a PR to address it.
Not quite. The ctx is the actual PortNamespace, so that code is checking whether the ports are still there, not the values that are passed for it in the inputs.
Right, fair enough but...
I think the check should still be there because the purpose is to make sure that if both the code and metadata.computer are passed in the inputs, then they should be commensurate. But if either of them are no longer part of the input namespace, then there really isn't anything to validate.
That's not quite true though, there are a bunch of other validations that should be run (e.g. is the computer stored, are the scheduler parameters correct) if code
and code.computer
are provided.
Thanks, both! clear to me now.
Unrelated to that, did you manage to figure out why the namespace exposed from a CalcJob class is dynamic?
I think I have some clue one this. I try to run @sphuber's MWE and got different output as below.
"sub": {
"_attrs": {
"default": [],
"dynamic": true,
"help": null,
"required": "True",
"valid_type": "<class 'aiida.orm.nodes.data.data.Data'>"
},
"b": {
"is_metadata": "False",
"name": "b",
"non_db": "False",
"required": "True"
},
"metadata": {
"_attrs": {
"default": [],
Then I realize I use the Process
class from aiida-core while in the in the case above it is the Process from plumpy. The valid_type
defined in https://github.com/aiidateam/aiida-core/blob/fce1cd6d7da2b0241830cab30a83eb5ab978a16d/aiida/engine/processes/process.py#L119 leads to the discrepancy. Afterward, the code in https://github.com/aiidateam/plumpy/blob/485b196455b848a68c5c8f8bdd2c720dcbc7bf4b/src/plumpy/ports.py#L396-L397 will make it a dynamic port.
So to me, this might be the direct root of the issue, the expose_inputs
has no doubt to be a namespace and therefore make no sense to have a valid_type of orm.Data
anymore.
Why the heck should setting a valid type automatically make the namespace dynamic??
This is asked by @danielhollas offline. @sphuber give me the reason for the initial design, can you put a quick comment here? It makes a lot of sense to me so I approve the PR for this issue.
The point is that if you set a valid_type
on a PortNamespace
, for example, you say "this namespace accepts str
objects", then it makes no sense to still refuse them because the namespace is not dynamic. In my mind, if you say "this namespace accepts Data
nodes" then that clearly signals that you accept any port there as long as its type is Data
. Otherwise, you would simply define explicit ports and set the valid_type
on those ports.
Fixed by #268
While testing my workchain with plumpy 0.21.6 for the recent boolean fix in aiidateam/plumpy#265, I noticed my workchain is broken in another way now (not present in version 0.21.5). I verified that the breakage indeed comes from the second feature introduced in 0.21.6 in aiidateam/plumpy#263.
Specifically, my workchain excepts right after submission, during the input port validation. Here's a detailed traceback.
Traceback
```python-traceback /opt/conda/lib/python3.9/site-packages/plumpy/ports.py in validate_ports(self, port_values, breadcrumbs) 722 """ 723 for name, port in self._ports.items(): --> 724 validation_error = port.validate(port_values.pop(name, UNSPECIFIED), breadcrumbs) 725 if validation_error: 726 return validation_error /opt/conda/lib/python3.9/site-packages/plumpy/ports.py in validate(self, port_values, breadcrumbs) 648 649 # In all other cases, validate all input ports explicitly specified in this port namespace --> 650 validation_error = self.validate_ports(port_values, breadcrumbs_local) 651 if validation_error: 652 return validation_error /opt/conda/lib/python3.9/site-packages/plumpy/ports.py in validate_ports(self, port_values, breadcrumbs) 722 """ 723 for name, port in self._ports.items(): --> 724 validation_error = port.validate(port_values.pop(name, UNSPECIFIED), breadcrumbs) 725 if validation_error: 726 return validation_error /opt/conda/lib/python3.9/site-packages/plumpy/ports.py in validate(self, port_values, breadcrumbs) 664 message = self.validator(port_values_clone) # type: ignore # pylint: disable=not-callable 665 else: --> 666 message = self.validator(port_values_clone, self) # pylint: disable=not-callable 667 if message is not None: 668 assert isinstance(message, str), \ /opt/conda/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/calcjob.py in validate_calc_job(inputs, ctx) 68 69 code = inputs.get('code', None) ---> 70 computer_from_code = code.computer 71 computer_from_metadata = inputs.get('metadata', {}).get('computer', None) 72 AttributeError: 'NoneType' object has no attribute 'computer' ```The error is strange because it comes from calcjob input port validation for input port that should not exist. Specifically, in my workchain I am calling a
aiida-orca
calcjob twice in different contexts, so I am exposing the calcjob inputs into two different namespaces. However, I am excluding thecode
node to prevent duplication, and specify it manually on the workchain itself, and then manually passing it to the calcjob upon submission within the workchain.Here's the workflow
https://github.com/ispg-group/aiidalab-ispg/blob/7bbff87cb1576d43aca8921cfd34825963f3e159/workflows/aiidalab_atmospec_workchain/__init__.py#L93
The relevant snippets
I believe this is a valid approach but please correct me if I am wrong.
I was quite puzzled as to why this should be broken by aiidateam/plumpy#263 because I am not explicitly using dynamic namespaces here, but I guess they are used internally.
Here's what seems to happen. The
get_port
function that was modified in aiidateam/plumpy#263 originally returned ValueError for the code nodes that were not exposed as inputs, but that was fine because in that case the validation was skipped, seehttps://github.com/aiidateam/aiida-core/blob/9de3f90a157ec7c9f99c1f17f3ca3548f6d96bba/aiida/engine/processes/calcjobs/calcjob.py#L54
But now it seems that the missing port is somehow created dynamically, but with
None
value which trips the validation code.Note that as a somewhat separate issue, the uncought exception is particularly confusing, because the code is not guarded against the
None
value, even though it clearly permits it in just the previous line.https://github.com/aiidateam/aiida-core/blob/9de3f90a157ec7c9f99c1f17f3ca3548f6d96bba/aiida/engine/processes/calcjobs/calcjob.py#L69
Not sure if this is something to be fixed in
plumpy
oraiida-core
but I am creating the issue here since the root cause was aplumpy
change in aiidateam/plumpy#263.CC @sphuber
NOTE: In my actual application, I am not calling the abovementioned workchain directly, but from within another, master workchain. Not sure if that is pertinent to the issue.
https://github.com/ispg-group/aiidalab-ispg/blob/7bbff87cb1576d43aca8921cfd34825963f3e159/workflows/aiidalab_atmospec_workchain/__init__.py#L341