aiidateam / aiida-quantumespresso

The official AiiDA plugin for Quantum ESPRESSO
https://aiida-quantumespresso.readthedocs.io
Other
52 stars 77 forks source link

PwBandsWorkchain.get_builder_from_protocol() not working on SGE machines #1010

Open stevenbos123 opened 4 months ago

stevenbos123 commented 4 months ago

The PwBandsWorkchain.get_builder_from_protocol() does not support SGE scheduler. This makes it cumbersome to do calculations SGE based machines. See error message below. This happens when providing a code that is run on a SGE computer.


ValueError                                Traceback (most recent call last)
Cell In[20], [line 4](vscode-notebook-cell:?execution_count=20&line=4)
      [1](vscode-notebook-cell:?execution_count=20&line=1) for pk in scf_pk_calculated:
      [2](vscode-notebook-cell:?execution_count=20&line=2)     node = load_node(pk)
----> [4](vscode-notebook-cell:?execution_count=20&line=4)     PwBandsWorkChain.get_builder_from_protocol(code=code,structure=node.inputs.structure)

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139), in PwBandsWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, options, **kwargs)
    [136](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:136) inputs = cls.get_protocol_inputs(protocol, overrides)
    [138](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:138) args = (code, structure, protocol)
--> [139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139) relax = PwRelaxWorkChain.get_builder_from_protocol(
    [140](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:140)     *args, overrides=inputs.get('relax', None), options=options, **kwargs
    [141](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:141) )
    [142](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:142) scf = PwBaseWorkChain.get_builder_from_protocol(
    [143](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:143)     *args, overrides=inputs.get('scf', None), options=options, **kwargs
    [144](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:144) )
    [145](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:145) bands = PwBaseWorkChain.get_builder_from_protocol(
    [146](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:146)     *args, overrides=inputs.get('bands', None), options=options, **kwargs
    [147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:147) )

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160), in PwRelaxWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, relax_type, options, **kwargs)
    [157](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:157)         raise ValueError(f'Structures with periodic boundary conditions `{structure.pbc}` are not supported.')
    [159](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:159) builder = cls.get_builder()
--> [160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160) builder.base = base
    [161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:161) builder.base_final_scf = base_final_scf
...
    [126](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:126) # of a ``ProcessBuilderNamespace`` for the port of the attribute that is being set and than iteratively set
    [127](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:127) # all the values within the mapping that is being assigned to the attribute.
    [128](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:128) if isinstance(port, PortNamespace):

ValueError: invalid attribute value input `metadata.options.resources` is not valid for the `SgeScheduler` scheduler: `parallel_env` must be specified and must be a string```
t-reents commented 4 months ago

Hi @stevenbos123 AiiDA does support SGE schedulers: https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#sge

I've never used SGE myself but only Slurm. However, based on your error message and the documentation (https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#parenvjobresource-sge-like), you would need to specify the attributes parallel_env and tot_num_mpiprocs in the metadata.options.resources of your CalcJob. The following should do the trick:

builder = PwBandsWorkChain.get_builder_from_protocol(
    code=code,
    structure=node.inputs.structure,
    options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 64}}
)

The provided values are only used to demonstrate the idea, you would need to adapt them accordingly, based on your system.

Hope that helps!

stevenbos123 commented 4 months ago

Hi @t-reents,

Thank you for the reply. It seems I did not provide enough information on my problem. Just like you suggested I specified the variables parallel_env and tot_num_mpiprocs. However this poses another problem, namely the variable num_machines is now automatically used as it is part of the default settings. It can be found here: src/aiida_quantumespresso/workflows/protocols/pw/base.yaml.

This poses a problem for SGE based systems as you do not provide the number of machines for them. How do I proceed from here? I provided the full stack trace below again for the updated problem.

Kind regards,

Steven


ValueError                                Traceback (most recent call last)
Cell In[58], [line 16](vscode-notebook-cell:?execution_count=58&line=16)
     [14](vscode-notebook-cell:?execution_count=58&line=14)     scf_parms = node.inputs.parameters
     [15](vscode-notebook-cell:?execution_count=58&line=15)     nscf_parms = parameters_builder_relaxed_nscf()
---> [16](vscode-notebook-cell:?execution_count=58&line=16)     builder = PwBandsWorkChain.get_builder_from_protocol(
     [17](vscode-notebook-cell:?execution_count=58&line=17)         code=code,
     [18](vscode-notebook-cell:?execution_count=58&line=18)         structure=node.inputs.structure,
     [19](vscode-notebook-cell:?execution_count=58&line=19)         options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 32}}
     [20](vscode-notebook-cell:?execution_count=58&line=20)     )
     [22](vscode-notebook-cell:?execution_count=58&line=22) #     inputs = {
     [23](vscode-notebook-cell:?execution_count=58&line=23) #     'code': code,
     [24](vscode-notebook-cell:?execution_count=58&line=24) #     'structure': structure,
   (...)
     [40](vscode-notebook-cell:?execution_count=58&line=40) #     bands_processes.append(process)
     [41](vscode-notebook-cell:?execution_count=58&line=41) #     bands_pks.append(process.pk)

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139), in PwBandsWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, options, **kwargs)
    [136](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:136) inputs = cls.get_protocol_inputs(protocol, overrides)
    [138](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:138) args = (code, structure, protocol)
--> [139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139) relax = PwRelaxWorkChain.get_builder_from_protocol(
    [140](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:140)     *args, overrides=inputs.get('relax', None), options=options, **kwargs
    [141](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:141) )
    [142](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:142) scf = PwBaseWorkChain.get_builder_from_protocol(
    [143](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:143)     *args, overrides=inputs.get('scf', None), options=options, **kwargs
    [144](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:144) )
    [145](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:145) bands = PwBaseWorkChain.get_builder_from_protocol(
    [146](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:146)     *args, overrides=inputs.get('bands', None), options=options, **kwargs
    [147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:147) )

File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160), in PwRelaxWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, relax_type, options, **kwargs)
    [157](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:157)         raise ValueError(f'Structures with periodic boundary conditions `{structure.pbc}` are not supported.')
    [159](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:159) builder = cls.get_builder()
--> [160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160) builder.base = base
    [161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:161) builder.base_final_scf = base_final_scf
    [162](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:162) builder.structure = structure

File [~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120), in ProcessBuilderNamespace.__setattr__(self, attr, value)
    [118](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:118)     validation_error = port.validate(value)  # type: ignore[union-attr]
    [119](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:119)     if validation_error:
--> [120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120)         raise ValueError(f'invalid attribute value {validation_error.message}')
    [122](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:122) # If the attribute that is being set corresponds to a port that is a ``PortNamespace`` we need to make sure
    [123](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:123) # that the nested value remains a ``ProcessBuilderNamespace``. Otherwise, the nested namespaces will become
    [124](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:124) # plain dictionaries and no longer have the properties of the ``ProcessBuilderNamespace`` that provide all
    [125](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:125) # the autocompletion and validation when values are being set. Therefore we first construct a new instance
    [126](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:126) # of a ``ProcessBuilderNamespace`` for the port of the attribute that is being set and than iteratively set
    [127](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:127) # all the values within the mapping that is being assigned to the attribute.
    [128](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:128) if isinstance(port, PortNamespace):

ValueError: invalid attribute value input `metadata.options.resources` is not valid for the `SgeScheduler` scheduler: these parameters were not recognized: num_machines```
t-reents commented 4 months ago

Thanks for the additional details, @stevenbos123! You're right, the problem comes from the default protocols, that specify num_machines. As a first fix, to get your jobs running, you could simply comment out/remove the following two lines: https://github.com/aiidateam/aiida-quantumespresso/blob/d645069d1d6ac40d6a23e4bbf49b01b51f5bb33c/src/aiida_quantumespresso/workflows/protocols/pw/base.yaml#L13-L14

If you provide the options as discussed before, this error shouldn't occur anymore.

Important to note: Apparently, the current protocols are designed to work with Slurm out of the box, so you wouldn't necessarily need to provide the options. If you comment out the lines above, you would also always need to provide the corresponding slurm resources via the options (in case you want to run a slurm job).

Another option would be to define scheduler dependent protocols, e.g. default_sge and default_slurm. To add the new protocols, you would need to further adjust the protocol file:

# Additional lines
...
protocols:
    moderate:
        description: 'Protocol to perform the computation at normal precision at moderate computational cost.'
    default_slurm:
        description: 'Default moderate protocol for slurm.'
        pw:
            metadata:
                options:
                    resources:
                        num_machines: 1
    default_sge:
        description: 'Default moderate protocol for sge.'
        pw:
            metadata:
                options:
                    resources:
                        parallel_env: mpi
                        tot_num_mpiprocs: 64
    precise:
...
# Additional lines

You would also need to add the new protocols to the protocol files for relax and bands. In that case, it would be sufficient to add the protocol name and the description, as the resources are only relevant in the base file.

I couldn't test the SGE protocol, but this would be the idea. This way, you should be able to simply call

builder = PwBandsWorkChain.get_builder_from_protocol(
    code=code,
    structure=node.inputs.structure,
    protocol='default_sge'
)

even without specifying the options (as I moved them to the protocol in this example). Again, the first option should already work, by just removing the corresponding lines and providing the options to the get_builder_from_protocol method.

t-reents commented 4 months ago

@mbercx Without thinking too much about it, might it make sense to introduce some logic in the get_builder_from_protocol, that chooses the default resources based on the scheduler of the Code and corresponding Computer? Otherwise, one would need to have these kind of scheduler dependent protocols, if the user should be able to use these kind of protocols, no matter which scheduler is used.

Maybe I'm also overlooking something.

t-reents commented 4 months ago

@stevenbos123 I realized that I made a small mistake in the first version of my comment, I just edited it. To make it easier, I adjusted the protocols on a local branch, in case you would like to use the second option with the new protocols. You can check it out here:

https://github.com/t-reents/aiida-quantumespresso/tree/fix/sge_protocol

If you clone it, you can change to the directory (remember to git checkout the correct branch) and install it via pip install -e . in your virtual environment.

sphuber commented 4 months ago

Thanks for the analysis @t-reents . I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols. You would have to start to do this everywhere and what happens if there is yet another scheduler plugin that you didn't cover yet.

If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.

It is true that the current implementation is already scheduler specific due to our own SLURM background, but I think we should rather remove that and make it completely agnostic than start adding more scheduler specific logic. The protocols automate almost everything already. I think it is more than fair to ask the user to think at least about the resources they want to use and specify those. As long as that is properly documented of course.

t-reents commented 4 months ago

Thank you for joining @sphuber.

I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols

Yes, this is definitely true. The examples were mainly intended to show how @stevenbos123 could adapt the protocols, in case he prefers that, and to get his setup running. In general, I would also not vote for creating scheduler-specific protocols as one would also need to duplicate them for the different "levels", e.g. precision and fast.

If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.

This is what I was thinking about in my comment to Marnik, but I also don't think that it'd be worth to implement this/make things more complicated, for the single purpose of providing initial resources.

As long as that is properly documented of course.

I think this is the most important point of this issue. So far, I've also expected that the protocols work just out of the box, and I've never thought about any scheduler dependency (as I think it's not explicitly mentioned anywhere, especially for new users).

So to sum up, we could think of these two (simple) options, right?

  1. Keep the current protocols but clearly state and document that those are SLURM based and document what the user would need to adjust for other schedulers
  2. Make the protocols independent of the schedulers, so to think of the protocols as a set of tested and reasonable parameters for calculations, while the specification of the resources is a responsibility of the user (which makes sense in a way, as systems will vary anyway). In this case, we would need to remove the inputs for metadata.options.resource from the protocols and probably make the options argument in get_builder_from_protocol non-default, right?
sphuber commented 4 months ago

Fully agree with the two options. Since they are not really mutually exclusive, we could already start with the first and simply improve the docs. We can then add a deprecation warning id no explicit resources are.provided saying this will start to error in the future. And then at some point we can remove the resource defaults to make the protocols scheduler agnostic

t-reents commented 4 months ago

Great! If you don't mind, I can open a PR in the upcoming days.

JPchico commented 1 month ago

Hi! Any update on this? We are also having this problem, I commented out the offending line in the protocol, but I wanted to know if there is a more long term solution in the pipeline.

t-reents commented 1 month ago

Hi @JPchico! I just continued my work on the corresponding PR introducing the warnings, sorry for the delay! I guess we will wait for some time after merging it and then remove the Slurm specific line in a future release, as indicated by @sphuber above.