aiidateam / aiida-quantumespresso

The official AiiDA plugin for Quantum ESPRESSO
https://aiida-quantumespresso.readthedocs.io
Other
52 stars 77 forks source link

Protocols: Move all static work chain inputs to protocol #902

Closed mbercx closed 1 year ago

mbercx commented 1 year ago

Fixes #830

Some "default" inputs are set inside one of the steps of several work chains, which:

  1. Can be confusing for the user, since the expected inputs are not there in the input files.
  2. Can be frustrating for the user when a different value is desirable, for a use case that may not immediately be obvious.
  3. Means default values are specified both in the work chain logic and protocol, making it more difficult to get a clear overview of the input parameters.

Here we move the specification of all static default inputs to the corresponding protocol file. Additionally:

sphuber commented 1 year ago

Finally, we switch to using 2 spaces for the indentation of the protocol YAML files to make them easier to read.

I realize that this is always going to be opinion-based, but I actually feel the opposite is true. I find the 2-space indentation very confusing and makes it difficult to see what level of nesting I am looking at. Why would we decide that 4 spaces for code is correct, but not for the configuration? Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now? Happy to discuss this in a separate issue/PR and if others users also really prefer 2-space, then I won't stop it.

Once the tests are then also running, I will review this. Thanks!

mbercx commented 1 year ago

Anyway, since this distracts from this PR and also makes the actual changes to the configs more difficult to spot, could you please remove this for now?

Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.

Note that this PR is built on top of https://github.com/aiidateam/aiida-quantumespresso/pull/903, so is still blocked until we merge that.

Edit: Nevermind, that was kind of unnecessary anyways!

sphuber commented 1 year ago

Sure! My preference for 2-space indentations may also be due to the fact that VScode adds vertical lines for the indentation in YAML files. The more compact 2-space indentations make it easier to see settings at a glance for me. But it's not critical.

My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅

mbercx commented 1 year ago

My editor also shows the tab-lines, but with 2-spaces, it just all merges into one big blob of lines 😅

This may be related to the fact that you use such ridiculously small fonts that only your Dutch-elvish eyes can read 👀

mbercx commented 1 year ago

Alright, I also added the restart inputs explicitly, and switched to using paro for the diagonalization. She's ready for review now, @sphuber.

sphuber commented 1 year ago

Before going into the review, two points that came to mind:

  1. Although I agree that it might make sense to "concentrate" the logic surrounding defaults, moving everything to the protocol has two downsides that I can think of:
    • It cannot use logic, and so "dynamic" defaults will still have to be set in code somewhere, that can still end up overriding initial inputs as defined in the protocol. So I am not sure if we can ever fully get around this. But maybe having slightly more concentrated logic is better than nothing.
    • The defaults and checks will now only be considered if the user goes through get_builder_from_protocol. When building the inputs manually and submitting directly, all of this is bypassed. We can still decide that users should simply (always) go through the get_builder_from_protocol but we should make this super duper clear.
  2. This only addresses the PwBandsWorkChain, but as I understand there are still "problems" downstream. I don't think this PR addresses #877 (not that you marked it as such, but just mentioning this). The warning is issued in the PwCalculation.validate_inputs_base and is actually incorrect. There are valid cases where you want restart_mode = 'restart' for an NSCF calculation. We should remove that warning. The PwBaseWorkChain.setup method is actually still overriding various input parameters. Is that logic still correct? Should that also be done in the protocol/get_builder_from_protocol? At least for the initialization maybe, the set_restart_type is still needed for actual restarts within the workchain. Anyway, what I am saying is that we might need to take a look at the whole stack and make sure it is consistent, no?
mbercx commented 1 year ago

Thanks for your thoughts, @sphuber!

It cannot use logic, and so "dynamic" defaults will still have to be set in code somewhere, that can still end up overriding initial inputs as defined in the protocol.

That's definitely true. We should probably try to report such dynamically set inputs so the user can more easily find them.

The defaults and checks will now only be considered if the user goes through get_builder_from_protocol.

Another good point! I would argue beginners are now immediately introduced to the get_builder_from_protocol() method when starting to run work chains, and only advanced users will be able to set up the inputs from scratch. I would definitely put default inputs in the protocol, to avoid the issue I mention here. We could consider also setting defaults (i.e. never fully override) inputs in the work chain logic, but then there is some duplicated information...

I'm all for making it super duper clear that the using the get_builder_from_protocol() is the way to go.

There are valid cases where you want restart_mode = 'restart' for an NSCF calculation.

Are there? In the QE docs on the restart_mode input:

https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65

The restart option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."

The PwBaseWorkChain.setup method is actually still overriding various input parameters. Is that logic still correct?

Haven't looked at these in the context of this PR, but happy to make it more "holistic", i.e. try to implement this principle downstream. Will give a crack at this tomorrow and report back!

mbercx commented 1 year ago

One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.

sphuber commented 1 year ago

I'm all for making it super duper clear that the using the get_builder_from_protocol() is the way to go.

Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol should be the main recommended entry point for users.

Are there? In the QE docs on the restart_mode input: https://www.quantum-espresso.org/Doc/INPUT_PW.html#idm65 The restart option clearly states: "Do not use to start a new one, or to perform a non-scf calculations."

As I understood it, this is to restart an interrupted NSCF calculation. But reading the docs again, it is a bit ambiguous, and thinking about it now, I am not sure if an NSCF calculation could be performed partially and shutdown cleanly and then restarted to perform the rest. Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.

One additional thing to note here is that we should really try and come at this from the perspective of the user, and perhaps it would be good to write down as many use cases as we can think of.

Fully agree, but that is the tricky point here. For novice users, you want to do as much as possible in an automatic way, but also correct inputs that are most likely incorrect. This is where it becomes difficult though, because sometimes there are expert users who do something on purpose and there is nothing more frustrating than the code overriding your desired behavior and not allowing to change it. But in that sense, maybe the get_builder_from_protocol is the best solution as we can make it as automated as possible for novice users, while power users can set anything on the builder afterwards without the workchain then interfering and double-guessing them.

mbercx commented 1 year ago

Maybe it can actually do part of all the k-points and then write to disk, and you can restart to do the final set of k-points? Would be interesting to test.

Well, I stand corrected, you actually can ^^. And this will indeed only work when restart_mode is set to restart. So setting this input is a legitimate use case for nscf and bands calculations.

I'm still conflicted on what to do with this code though:

https://github.com/aiidateam/aiida-quantumespresso/blob/5cae75ff671268dc1e5684e1c84f801a10839e0b/src/aiida_quantumespresso/workflows/pw/base.py#L247-L252

Let me go over some use cases to discuss this change and if we want to keep it.

Use case A

Let me first repeat here the use case you mentioned here that was the reason for adding this code:

This resulted from a case from a user. They launched a calculation and it didn't finish. I pointed them to get_builder_restart as an easy way to get a builder that can be launched to restart and finish the calculation. However, as described in the commit message, this would still keep from_scratch and essentially the calculation would be redone and not finish.

Example run of use case. ```python builder = PwBaseWorkChain.get_builder_from_protocol( code=pw_code, structure=structure, ) builder.clean_workdir = orm.Bool(False) builder.max_iterations = orm.Int(1) builder.pw.parameters['CONTROL']['max_seconds'] = 20 wc_node = submit(builder) while not wc_node.is_finished: time.sleep(5) parent_folder = wc_node.called_descendants[-1].outputs.remote_folder restart_builder = wc_node.get_builder_restart() restart_builder.pw.parent_folder = parent_folder submit(restart_builder) ```

A couple of comments here:

  1. This will only work in case the user set clean_workdir to False for the initial run. The default for all protocols is True.
  2. The user will still have to get the remote_folder from the last PwCalculation in the failed PwBaseWorkChain. By testing this, I actually found out the outputs or the wrapped PwCalculation are not added in case the PwBaseWorkChain fails. Perhaps this is something we should fix?
  3. Restarting the calculation like this will raise a warning:
    /home/mbercx/envs/aiida-dev/code/aiida-quantumespresso/src/aiida_quantumespresso/calculations/pw.py:178: 
    UserWarning: `parent_folder` input was provided for the `PwCalculation`, but no input parameters are set to restart from 
    these files.

    Which I still believe to be a fair warning to have.

Without the code that sets the restart_mode when a parent_folder is present, the user would have to add the following code:

parameters = restart_builder.pw.parameters.get_dict()
parameters['CONTROL']['restart_mode'] = 'restart'
restart_builder.pw.parameters = orm.Dict(parameters)

A bit tedious perhaps, but not insurmountable. And the validation warning implemented on the PwCalculation inputs will point them in this direction (but should perhaps be made more clear).

Use case B

The user has successfully completed a PwBaseWorkChain, and they want to perform a calculation on top of this one starting from the charge density. Since from_scratch is the default setting for restart_mode, they will simply add the parent_folder input and set the ELECTRONS.starting_pot parameter to file.

By adding the code above, restart_mode will now be set to restart and the startingpot input will be removed. Now, in many cases the QE behaviour will still be as desired. It seems QE will ignore the wave function files in case some parts of the setup have changed (e.g. number of bands/kpoints) or the wave function files are missing (restart from stashed charge density). However, there may be use cases where the restart behaviour is not as the user desired:

  1. The restart calculation only changes occupations == fixed. In this case it seems QE will restart from both the charge density and wave functions:

        The initial density is read from file :
        ./out/aiida.save/charge-density
        starting charge       8.0001, renormalised to       8.0000
        Starting wfcs from file

    Which may be undesirable, also see this comment.

  2. The restart calculation wants to use a different geometry but start from the charge density (there may be use cases for this). With restart_mode == restart, the geometry will be read from the previous output, not the new one provided in the inputs:

         Atomic positions and unit cell read from directory:
        ./out/aiida.save/

I can probably think of others. The user could of course explicitly set restart_mode to from_scratch to avoid this, but I don't think we can expect a user to have to consider this, since this is the default of restart_mode in QE.

Use case C

Many work chains rely on restarts from previous steps in the work chain. The example that we try to fix in this PR is the one raised by @unkcpz (https://github.com/aiidateam/aiida-quantumespresso/issues/877), where the restart for the band structure calculation suddenly needs to explicitly set restart_mode to from_scratch. We can fix this by being explicit in the protocols, as we do here, but there are other higher-level work chains out there that might not have considered this.

Conclusion

Although the addition of the code above in https://github.com/aiidateam/aiida-quantumespresso/commit/0aba276f7042d51b91d5699530de7336cd62e2c2 makes it slightly more straightforward to restart from interrupted runs, it complicates many other use cases in a manner that can be confusing and frustrating for the user. We could think of ways to restrict when the restart_mode input is set to restart, but there are so many different use cases of restarting I think it's difficult if not impossible to consider all of them.

I think we should:

  1. (This PR) Remove the code above, and add documentation on how to restart from interrupted runs to help with use case A.
  2. (https://github.com/aiidateam/aiida-quantumespresso/pull/906) Remove some of the warnings related to restarts from a parent_folder. E.g. it seems restarting an nscf or bands calculation with restart_mode == restart is a valid use case for interrupted runs.
  3. (#923) Add the outputs of the last PwCalculation to those of the PwBaseWorkChain even if the calculation didn't complete successfully.
  4. (aiida-core) Fix an issue related to setting nested Dict content silently failing, see below.

Side tangent: trying to set the restart_mode directly on the parameters node will of course not work:

restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'

since the node is the same one as the already run PwBaseWorkChain and hence is stored and immutable. However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this. In short, trying to change a top-level key of the Dict node raises:

In [1]: d = Dict({'a': 1})
   ...: d['b'] = 2
   ...: print(d.get_dict())
{'a': 1, 'b': 2}

In [2]: d.store()
   ...: d['c'] = 3
   ...: print(d.get_dict())
---------------------------------------------------------------------------
ModificationNotAllowed                    Traceback (most recent call last)
Cell In[2], line 2
      1 d.store()
----> 2 d['c'] = 3
      3 print(d.get_dict())

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/data/dict.py:71, in Dict.__setitem__(self, key, value)
     70 def __setitem__(self, key, value):
---> 71     self.base.attributes.set(key, value)

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/attributes.py:118, in NodeAttributes.set(self, key, value)
    110 def set(self, key: str, value: Any) -> None:
    111     """Set an attribute to the given value.
    112 
    113     :param key: name of the attribute
   (...)
    116     :raise aiida.common.ModificationNotAllowed: if the entity is stored
    117     """
--> 118     self._node._check_mutability_attributes([key])  # pylint: disable=protected-access
    119     self._backend_node.set_attribute(key, value)

File ~/envs/aiida-dev/code/aiida-core/aiida/orm/nodes/node.py:209, in Node._check_mutability_attributes(self, keys)
    202 """Check if the entity is mutable and raise an exception if not.
    203 
    204 This is called from `NodeAttributes` methods that modify the attributes.
    205 
    206 :param keys: the keys that will be mutated, or all if None
    207 """
    208 if self.is_stored:
--> 209     raise exceptions.ModificationNotAllowed('the attributes of a stored entity are immutable')

ModificationNotAllowed: the attributes of a stored entity are immutable

But trying to change nested content of the Dict node fails silently:

In [3]: d = Dict({'a': {'b': None, 'c': None}})
   ...: d['a']['b'] = 2
   ...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}

In [4]: d.store()
   ...: d['a']['c'] = 3
   ...: print(d.get_dict())
{'a': {'b': 2, 'c': None}}
sphuber commented 1 year ago

Thanks for the detailed writeup @mbercx , this is super important and useful. Let me address your points, starting with the last:

However, the above line of code will raise no warning or error! @sphuber I think this was already raised at some point, but not sure if there is an open issue for this.

I definitely remember writing about this in an issue and basically coming to the conclusion that it will be very difficult to come up with a solution for this. Essentially what is happening with the following code:

restart_builder.pw.parameters['CONTROL']['restart_mode'] = 'restart'

First restart_builder.pw.parameters['CONTROL'] calls _getitem__ with CONTROL as key. As the implementation shows:

    def __getitem__(self, key):
        try:
            return self.base.attributes.get(key)
        except AttributeError as exc:
            raise KeyError from exc

this will get the attribute from the database with that name and return it. The return type is a plain Python dictionary, let's call it d. So next, the code calls d['restart_mode'] = 'restart, which will indeed set the restart_mode key to restart, but that is of the Python dictionary d and not of the node attributes.

The only "solution" I see here is to have Dict.__getitem__ not return a plain dict but some custom class that implements all dict methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning. Now while this may be possible, I see many potential pitfalls that will be difficult to foresee.

Then, as for your other points. I see how helping making one use-case easier, may hamstring another user. Maybe it is true that in the end, it is quite complicated and we cannot warn/prevent for all use-cases and will have to leave some up to documentation.

Add the outputs of the last PwCalculation to those of the PwBaseWorkChain even if the calculation didn't complete successfully.

Don't see a problem with that, fine to open a PR for this.

(https://github.com/aiidateam/aiida-quantumespresso/pull/906) Remove some of the warnings related to restarts from a parent_folder. E.g. it seems restarting an nscf or bands calculation with restart_mode == restart is a valid use case for interrupted runs.

:+1:

(This PR) Remove the code above, and add documentation on how to restart from interrupted runs to help with use case A.

Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄

mbercx commented 1 year ago

Fine, but I will be sending all complaints of people, trying to restart and the calculation failing again because it is restarting from scratch, to you 😄

Hehe, agreed! I think there is more risk of someone using the PwBaseWorkChain running into a "cleaned working directory" is bigger though, see https://github.com/aiidateam/aiida-quantumespresso/issues/915.

Regarding the documentation, I'm always sort of paralysed when I start trying to add something to the docs, because there are so many things I would change I feel I would go down a documentation rabbit hole that would cause me sleep deprivation. So I've opened an issue for this instead:

https://github.com/aiidateam/aiida-quantumespresso/issues/916

I think we should have a meeting to discuss the documentation again, also in light of this decision:

Ok, I'm on board. Let's go in that direction then that protocols should be as complete as possible and get_builder_from_protocol should be the main recommended entry point for users.

EDIT: on the following:

The only "solution" I see here is to have Dict.getitem not return a plain dict but some custom class that implements all dict methods, but overrides all setters and delete attributes (and potentially others) and either raises or emits a warning.

I see, quite tricky indeed... I understand a solution is not simple, but I think we should thinking about doing so for stored nodes. It's quite an insidious bug that this passes without warning/error. Let me try and dig up some issues on this.

EDIT: Couldn't find an issue for this, so opened https://github.com/aiidateam/aiida-core/issues/5970

mbercx commented 1 year ago

@sphuber OP message updated for new commit message, tests adapted. Ready for review!

mbercx commented 1 year ago

That being said, I do think it is very important we add as fast as possible some minimal documentation that at the very least states that the workchains do not automatically set required defaults when launched directly and they can fail, and that therefore they are recommended to use the get_builder_from_protocol. I am pretty sure that there will be quite some users that do not use the protocols and will not set all the proper defaults because the workchains used to do that.

Agreed. I would love to give the documentation some love soon. Will try to find the time later this week!