Closed mbercx closed 1 year ago
Finally, we switch to using 2 spaces for the indentation of the protocol YAML files to make them easier to read.
I realize that this is always going to be opinion-based, but I actually feel the opposite is true. I find the 2-space indentation very confusing and makes it difficult to see what level of nesting I am looking at. Why would we decide that 4 spaces for code is correct, but not for the configuration? Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now? Happy to discuss this in a separate issue/PR and if others users also really prefer 2-space, then I won't stop it.
Once the tests are then also running, I will review this. Thanks!
Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now?
Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.
Note that this PR is built on top of https://github.com/aiidateam/aiida-quantumespresso/pull/903, so is still blocked until we merge that.
Edit: Nevermind, that was kind of unnecessary anyways!
Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.
My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅
My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅
This may be related to the fact that you use such ridiculously small fonts that only your Dutch-elvish eyes can read 👀
Alright, I also added the restart inputs explicitly, and switched to using paro
for the diagonalization
. She's ready for review now, @sphuber.
Before going into the review, two points that came to mind:
get_builder_from_protocol
. When building the inputs manually and submitting directly, all of this is bypassed. We can still decide that users should simply (always) go through the get_builder_from_protocol
but we should make this super duper clear.PwBandsWorkChain
, but as I understand there are still "problems" downstream. I don't think this PR addresses #877 (not that you marked it as such, but just mentioning this). The warning is issued in the PwCalculation.validate_inputs_base
and is actually incorrect. There are valid cases where you want restart_mode = 'restart'
for an NSCF calculation. We should remove that warning. The PwBaseWorkChain.setup
method is actually still overriding various input parameters. Is that logic still correct? Should that also be done in the protocol/get_builder_from_protocol
? At least for the initialization maybe, the set_restart_type
is still needed for actual restarts within the workchain. Anyway, what I am saying is that we might need to take a look at the whole stack and make sure it is consistent, no?Thanks for your thoughts, @sphuber!
It cannot use logic, and so "dynamic" defaults will still have to be set in code somewhere, that can still end up overriding initial inputs as defined in the protocol.
That's definitely true. We should probably try to report such dynamically set inputs so the user can more easily find them.
The defaults and checks will now only be considered if the user goes through get_builder_from_protocol.
Another good point! I would argue beginners are now immediately introduced to the get_builder_from_protocol()
method when starting to run work chains, and only advanced users will be able to set up the inputs from scratch. I would definitely put default inputs in the protocol, to avoid the issue I mention here. We could consider also setting defaults (i.e. never fully override) inputs in the work chain logic, but then there is some duplicated information...
I'm all for making it super duper clear that the using the get_builder_from_protocol()
is the way to go.
There are valid cases where you want restart_mode = 'restart' for an NSCF calculation.
Are there? In the QE docs on the restart_mode
input:
https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65
The restart
option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."
The PwBaseWorkChain.setup method is actually still overriding various input parameters. Is that logic still correct?
Haven't looked at these in the context of this PR, but happy to make it more "holistic", i.e. try to implement this principle downstream. Will give a crack at this tomorrow and report back!
One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.
I'm all for making it super duper clear that the using the get_builder_from_protocol() is the way to go.
Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol
should be the main recommended entry point for users.
Are there? In the QE docs on the restart_mode input: https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65 The restart option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."
As I understood it, this is to restart an interrupted NSCF calculation. But reading the docs again, it is a bit ambiguous, and thinking about it now, I am not sure if an NSCF calculation could be performed partially and shutdown cleanly and then restarted to perform the rest. Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.
One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.
Fully agree, but that is the tricky point here. For novice users, you want to do as much as possible in an automatic way, but also correct inputs that are most likely incorrect. This is where it becomes difficult though, because sometimes there are expert users who do something on purpose and there is nothing more frustrating than the code overriding your desired behavior and not allowing to change it. But in that sense, maybe the get_builder_from_protocol
is the best solution as we can make it as automated as possible for novice users, while power users can set anything on the builder afterwards without the workchain then interfering and double-guessing them.
Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.
Well, I stand corrected, you actually can ^^. And this will indeed only work when restart_mode
is set to restart
. So setting this input is a legitimate use case for nscf
and bands
calculations.
I'm still conflicted on what to do with this code though:
Let me go over some use cases to discuss this change and if we want to keep it.
Let me first repeat here the use case you mentioned here that was the reason for adding this code:
This resulted from a case from a user. They launched a calculation and it didn't finish. I pointed them to
get_builder_restart
as an easy way to get a builder that can be launched to restart and finish the calculation. However, as described in the commit message, this would still keepfrom_scratch
and essentially the calculation would be redone and not finish.
A couple of comments here:
clean_workdir
to False
for the initial run. The default for all protocols is True
.remote_folder
from the last PwCalculation
in the failed PwBaseWorkChain
. By testing this, I actually found out the outputs or the wrapped PwCalculation
are not added in case the PwBaseWorkChain
fails. Perhaps this is something we should fix?/home/mbercx/envs/aiida-dev/code/aiida-quantumespresso/src/aiida_quantumespresso/calculations/pw.py:178:
UserWarning: `parent_folder` input was provided for the `PwCalculation`, but no input parameters are set to restart from
these files.
Which I still believe to be a fair warning to have.
Without the code that sets the restart_mode
when a parent_folder
is present, the user would have to add the following code:
parameters = restart_builder.pw.parameters.get_dict()
parameters['CONTROL']['restart_mode'] = 'restart'
restart_builder.pw.parameters = orm.Dict(parameters)
A bit tedious perhaps, but not insurmountable. And the validation warning implemented on the PwCalculation
inputs will point them in this direction (but should perhaps be made more clear).
The user has successfully completed a PwBaseWorkChain
, and they want to perform a calculation on top of this one starting from the charge density. Since from_scratch
is the default setting for restart_mode
, they will simply add the parent_folder
input and set the ELECTRONS.starting_pot
parameter to file
.
By adding the code above, restart_mode
will now be set to restart
and the startingpot
input will be removed. Now, in many cases the QE behaviour will still be as desired. It seems QE will ignore the wave function files in case some parts of the setup have changed (e.g. number of bands/kpoints) or the wave function files are missing (restart from stashed charge density). However, there may be use cases where the restart behaviour is not as the user desired:
The restart calculation only changes occupations == fixed
. In this case it seems QE will restart from both the charge density and wave functions:
The initial density is read from file :
./out/aiida.save/charge-density
starting charge 8.0001, renormalised to 8.0000
Starting wfcs from file
Which may be undesirable, also see this comment.
The restart calculation wants to use a different geometry but start from the charge density (there may be use cases for this). With restart_mode == restart
, the geometry will be read from the previous output, not the new one provided in the inputs:
Atomic positions and unit cell read from directory:
./out/aiida.save/
I can probably think of others. The user could of course explicitly set restart_mode
to from_scratch
to avoid this, but I don't think we can expect a user to have to consider this, since this is the default of restart_mode
in QE.
Many work chains rely on restarts from previous steps in the work chain. The example that we try to fix in this PR is the one raised by @unkcpz (https://github.com/aiidateam/aiida-quantumespresso/issues/877), where the restart for the band structure calculation suddenly needs to explicitly set restart_mode
to from_scratch
. We can fix this by being explicit in the protocols, as we do here, but there are other higher-level work chains out there that might not have considered this.
Although the addition of the code above in https://github.com/aiidateam/aiida-quantumespresso/commit/0aba276f7042d51b91d5699530de7336cd62e2c2 makes it slightly more straightforward to restart from interrupted runs, it complicates many other use cases in a manner that can be confusing and frustrating for the user. We could think of ways to restrict when the restart_mode
input is set to restart
, but there are so many different use cases of restarting I think it's difficult if not impossible to consider all of them.
I think we should:
parent_folder
. E.g. it seems restarting an nscf
or bands
calculation with restart_mode == restart
is a valid use case for interrupted runs.PwCalculation
to those of the PwBaseWorkChain
even if the calculation didn't complete successfully.aiida-core
) Fix an issue related to setting nested Dict
content silently failing, see below.Side tangent: trying to set the restart_mode
directly on the parameters
node will of course not work:
restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'
since the node is the same one as the already run PwBaseWorkChain
and hence is stored and immutable. However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this. In short, trying to change a top-level key of the Dict
node raises:
In [1]: d = Dict({'a': 1})
...: d['b'] = 2
...: print(d.get_dict())
{'a': 1, 'b': 2}
In [2]: d.store()
...: d['c'] = 3
...: print(d.get_dict())
---------------------------------------------------------------------------
ModificationNotAllowed Traceback (most recent call last)
Cell In[2], line 2
1 d.store()
----> 2 d['c'] = 3
3 print(d.get_dict())
File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/data/dict.py:71, in Dict.__setitem__(self, key, value)
70 def __setitem__(self, key, value):
---> 71 self.base.attributes.set(key, value)
File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/attributes.py:118, in NodeAttributes.set(self, key, value)
110 def set(self, key: str, value: Any) -> None:
111 """Set an attribute to the given value.
112
113 :param key: name of the attribute
(...)
116 :raise aiida.common.ModificationNotAllowed: if the entity is stored
117 """
--> 118 self._node._check_mutability_attributes([key]) # pylint: disable=protected-access
119 self._backend_node.set_attribute(key, value)
File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/node.py:209, in Node._check_mutability_attributes(self, keys)
202 """Check if the entity is mutable and raise an exception if not.
203
204 This is called from `NodeAttributes` methods that modify the attributes.
205
206 :param keys: the keys that will be mutated, or all if None
207 """
208 if self.is_stored:
--> 209 raise exceptions.ModificationNotAllowed('the attributes of a stored entity are immutable')
ModificationNotAllowed: the attributes of a stored entity are immutable
But trying to change nested content of the Dict
node fails silently:
In [3]: d = Dict({'a': {'b': None, 'c': None}})
...: d['a']['b'] = 2
...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}
In [4]: d.store()
...: d['a']['c'] = 3
...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}
Thanks for the detailed writeup @mbercx , this is super important and useful. Let me address your points, starting with the last:
However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this.
I definitely remember writing about this in an issue and basically coming to the conclusion that it will be very difficult to come up with a solution for this. Essentially what is happening with the following code:
restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'
First restart_builder.pw.parameters['CONTROL']
calls _getitem__
with CONTROL
as key. As the implementation shows:
def __getitem__(self, key):
try:
return self.base.attributes.get(key)
except AttributeError as exc:
raise KeyError from exc
this will get the attribute from the database with that name and return it. The return type is a plain Python dictionary, let's call it d
. So next, the code calls d['restart_mode'] = 'restart
, which will indeed set the restart_mode
key to restart
, but that is of the Python dictionary d
and not of the node attributes.
The only "solution" I see here is to have Dict.__getitem__
not return a plain dict
but some custom class that implements all dict
methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning. Now while this may be possible, I see many potential pitfalls that will be difficult to foresee.
Then, as for your other points. I see how helping making one use-case easier, may hamstring another user. Maybe it is true that in the end, it is quite complicated and we cannot warn/prevent for all use-cases and will have to leave some up to documentation.
Add the outputs of the last PwCalculation to those of the PwBaseWorkChain even if the calculation didn't complete successfully.
Don't see a problem with that, fine to open a PR for this.
(https://github.com/aiidateam/aiida-quantumespresso/pull/906) Remove some of the warnings related to restarts from a parent_folder. E.g. it seems restarting an nscf or bands calculation with restart_mode == restart is a valid use case for interrupted runs.
:+1:
(This PR) Remove the code above, and add documentation on how to restart from interrupted runs to help with use case A.
Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄
Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄
Hehe, agreed! I think there is more risk of someone using the PwBaseWorkChain
running into a "cleaned working directory" is bigger though, see https://github.com/aiidateam/aiida-quantumespresso/issues/915.
Regarding the documentation, I'm always sort of paralysed when I start trying to add something to the docs, because there are so many things I would change I feel I would go down a documentation rabbit hole that would cause me sleep deprivation. So I've opened an issue for this instead:
https://github.com/aiidateam/aiida-quantumespresso/issues/916
I think we should have a meeting to discuss the documentation again, also in light of this decision:
Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol should be the main recommended entry point for users.
EDIT: on the following:
The only "solution" I see here is to have Dict.getitem not return a plain dict but some custom class that implements all dict methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning.
I see, quite tricky indeed... I understand a solution is not simple, but I think we should thinking about doing so for stored nodes. It's quite an insidious bug that this passes without warning/error. Let me try and dig up some issues on this.
EDIT: Couldn't find an issue for this, so opened https://github.com/aiidateam/aiida-core/issues/5970
@sphuber OP message updated for new commit message, tests adapted. Ready for review!
That being said, I do think it is very important we add as fast as possible some minimal documentation that at the very least states that the workchains do not automatically set required defaults when launched directly and they can fail, and that therefore they are recommended to use the get_builder_from_protocol. I am pretty sure that there will be quite some users that do not use the protocols and will not set all the proper defaults because the workchains used to do that.
Agreed. I would love to give the documentation some love soon. Will try to find the time later this week!
Fixes #830
Some "default" inputs are set inside one of the steps of several work chains, which:
Here we move the specification of all static default inputs to the corresponding protocol file. Additionally:
bands
, thevalidate_inputs
validator of thePwCalculation
will raise a warning because aparent_folder
has not been initially provided. Hence, we set thevalidate_inputs_base
validator for thepw
port of thebands
namespace, as is also done for e.g. thenscf
of thePdosWorkchain
.base_final_scf
part of thePwRelaxWorkChain
protocol, that erroneously specifiedCELL.press_conv_thr
.PwBaseWorkChain.set_max_seconds
toPwBaseWorkChain.prepare_process
. Having a separate method for these few lines of code seemed unnecessary, especially since it's the only step that is executed inprepare_process
.RestartType.FULL
in case theparent_folder
input is provided for thePwBaseWorkChain
. There are simply too many different use cases of restarting I think it's difficult if not impossible to consider all of them, and in several known use cases this code addition would cause more harm than good.