Open stevenbos123 opened 4 months ago
Hi @stevenbos123
AiiDA
does support SGE
schedulers: https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#sge
I've never used SGE
myself but only Slurm
. However, based on your error message and the documentation (https://aiida.readthedocs.io/projects/aiida-core/en/latest/topics/schedulers.html#parenvjobresource-sge-like), you would need to specify the attributes parallel_env
and tot_num_mpiprocs
in the metadata.options.resources
of your CalcJob
.
The following should do the trick:
builder = PwBandsWorkChain.get_builder_from_protocol(
code=code,
structure=node.inputs.structure,
options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 64}}
)
The provided values are only used to demonstrate the idea, you would need to adapt them accordingly, based on your system.
Hope that helps!
Hi @t-reents,
Thank you for the reply. It seems I did not provide enough information on my problem. Just like you suggested I specified the variables parallel_env
and tot_num_mpiprocs
. However this poses another problem, namely the variable num_machines
is now automatically used as it is part of the default settings. It can be found here:
src/aiida_quantumespresso/workflows/protocols/pw/base.yaml.
This poses a problem for SGE based systems as you do not provide the number of machines for them. How do I proceed from here? I provided the full stack trace below again for the updated problem.
Kind regards,
Steven
ValueError Traceback (most recent call last)
Cell In[58], [line 16](vscode-notebook-cell:?execution_count=58&line=16)
[14](vscode-notebook-cell:?execution_count=58&line=14) scf_parms = node.inputs.parameters
[15](vscode-notebook-cell:?execution_count=58&line=15) nscf_parms = parameters_builder_relaxed_nscf()
---> [16](vscode-notebook-cell:?execution_count=58&line=16) builder = PwBandsWorkChain.get_builder_from_protocol(
[17](vscode-notebook-cell:?execution_count=58&line=17) code=code,
[18](vscode-notebook-cell:?execution_count=58&line=18) structure=node.inputs.structure,
[19](vscode-notebook-cell:?execution_count=58&line=19) options={'resources': {'parallel_env': 'mpi', 'tot_num_mpiprocs': 32}}
[20](vscode-notebook-cell:?execution_count=58&line=20) )
[22](vscode-notebook-cell:?execution_count=58&line=22) # inputs = {
[23](vscode-notebook-cell:?execution_count=58&line=23) # 'code': code,
[24](vscode-notebook-cell:?execution_count=58&line=24) # 'structure': structure,
(...)
[40](vscode-notebook-cell:?execution_count=58&line=40) # bands_processes.append(process)
[41](vscode-notebook-cell:?execution_count=58&line=41) # bands_pks.append(process.pk)
File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139), in PwBandsWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, options, **kwargs)
[136](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:136) inputs = cls.get_protocol_inputs(protocol, overrides)
[138](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:138) args = (code, structure, protocol)
--> [139](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:139) relax = PwRelaxWorkChain.get_builder_from_protocol(
[140](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:140) *args, overrides=inputs.get('relax', None), options=options, **kwargs
[141](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:141) )
[142](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:142) scf = PwBaseWorkChain.get_builder_from_protocol(
[143](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:143) *args, overrides=inputs.get('scf', None), options=options, **kwargs
[144](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:144) )
[145](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:145) bands = PwBaseWorkChain.get_builder_from_protocol(
[146](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:146) *args, overrides=inputs.get('bands', None), options=options, **kwargs
[147](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/bands.py:147) )
File [~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160), in PwRelaxWorkChain.get_builder_from_protocol(cls, code, structure, protocol, overrides, relax_type, options, **kwargs)
[157](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:157) raise ValueError(f'Structures with periodic boundary conditions `{structure.pbc}` are not supported.')
[159](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:159) builder = cls.get_builder()
--> [160](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:160) builder.base = base
[161](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:161) builder.base_final_scf = base_final_scf
[162](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida_quantumespresso/workflows/pw/relax.py:162) builder.structure = structure
File [~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120), in ProcessBuilderNamespace.__setattr__(self, attr, value)
[118](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:118) validation_error = port.validate(value) # type: ignore[union-attr]
[119](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:119) if validation_error:
--> [120](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:120) raise ValueError(f'invalid attribute value {validation_error.message}')
[122](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:122) # If the attribute that is being set corresponds to a port that is a ``PortNamespace`` we need to make sure
[123](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:123) # that the nested value remains a ``ProcessBuilderNamespace``. Otherwise, the nested namespaces will become
[124](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:124) # plain dictionaries and no longer have the properties of the ``ProcessBuilderNamespace`` that provide all
[125](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:125) # the autocompletion and validation when values are being set. Therefore we first construct a new instance
[126](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:126) # of a ``ProcessBuilderNamespace`` for the port of the attribute that is being set and than iteratively set
[127](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:127) # all the values within the mapping that is being assigned to the attribute.
[128](https://vscode-remote+wsl-002bubuntu.vscode-resource.vscode-cdn.net/home/steven/~/aiida-venv/lib/python3.10/site-packages/aiida/engine/processes/builder.py:128) if isinstance(port, PortNamespace):
ValueError: invalid attribute value input `metadata.options.resources` is not valid for the `SgeScheduler` scheduler: these parameters were not recognized: num_machines```
Thanks for the additional details, @stevenbos123! You're right, the problem comes from the default protocols, that specify num_machines
.
As a first fix, to get your jobs running, you could simply comment out/remove the following two lines:
https://github.com/aiidateam/aiida-quantumespresso/blob/d645069d1d6ac40d6a23e4bbf49b01b51f5bb33c/src/aiida_quantumespresso/workflows/protocols/pw/base.yaml#L13-L14
If you provide the options
as discussed before, this error shouldn't occur anymore.
Important to note: Apparently, the current protocols are designed to work with Slurm out of the box, so you wouldn't necessarily need to provide the options. If you comment out the lines above, you would also always need to provide the corresponding slurm resources via the options (in case you want to run a slurm job).
Another option would be to define scheduler dependent protocols, e.g. default_sge
and default_slurm
. To add the new protocols, you would need to further adjust the protocol file:
# Additional lines
...
protocols:
moderate:
description: 'Protocol to perform the computation at normal precision at moderate computational cost.'
default_slurm:
description: 'Default moderate protocol for slurm.'
pw:
metadata:
options:
resources:
num_machines: 1
default_sge:
description: 'Default moderate protocol for sge.'
pw:
metadata:
options:
resources:
parallel_env: mpi
tot_num_mpiprocs: 64
precise:
...
# Additional lines
You would also need to add the new protocols to the protocol files for relax
and bands
. In that case, it would be sufficient to add the protocol name and the description, as the resources are only relevant in the base
file.
I couldn't test the SGE
protocol, but this would be the idea. This way, you should be able to simply call
builder = PwBandsWorkChain.get_builder_from_protocol(
code=code,
structure=node.inputs.structure,
protocol='default_sge'
)
even without specifying the options (as I moved them to the protocol in this example).
Again, the first option should already work, by just removing the corresponding lines and providing the options
to the get_builder_from_protocol
method.
@mbercx Without thinking too much about it, might it make sense to introduce some logic in the get_builder_from_protocol
, that chooses the default resources based on the scheduler of the Code
and corresponding Computer
? Otherwise, one would need to have these kind of scheduler dependent protocols, if the user should be able to use these kind of protocols, no matter which scheduler is used.
Maybe I'm also overlooking something.
@stevenbos123 I realized that I made a small mistake in the first version of my comment, I just edited it. To make it easier, I adjusted the protocols on a local branch, in case you would like to use the second option with the new protocols. You can check it out here:
https://github.com/t-reents/aiida-quantumespresso/tree/fix/sge_protocol
If you clone it, you can change to the directory (remember to git checkout
the correct branch) and install it via pip install -e .
in your virtual environment.
Thanks for the analysis @t-reents . I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols. You would have to start to do this everywhere and what happens if there is yet another scheduler plugin that you didn't cover yet.
If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.
It is true that the current implementation is already scheduler specific due to our own SLURM background, but I think we should rather remove that and make it completely agnostic than start adding more scheduler specific logic. The protocols automate almost everything already. I think it is more than fair to ask the user to think at least about the resources they want to use and specify those. As long as that is properly documented of course.
Thank you for joining @sphuber.
I would personally be hesitant to put scheduler based logic in the protocols or even make scheduler specific protocols
Yes, this is definitely true. The examples were mainly intended to show how @stevenbos123 could adapt the protocols, in case he prefers that, and to get his setup running. In general, I would also not vote for creating scheduler-specific protocols as one would also need to duplicate them for the different "levels", e.g. precision and fast.
If the scheduler would implement some method to provide default resources that the workflow/protocol could just call blindly, that could work. But I don't think that exists yet in the scheduler interface.
This is what I was thinking about in my comment to Marnik, but I also don't think that it'd be worth to implement this/make things more complicated, for the single purpose of providing initial resources.
As long as that is properly documented of course.
I think this is the most important point of this issue. So far, I've also expected that the protocols work just out of the box, and I've never thought about any scheduler dependency (as I think it's not explicitly mentioned anywhere, especially for new users).
So to sum up, we could think of these two (simple) options, right?
metadata.options.resource
from the protocols and probably make the options
argument in get_builder_from_protocol
non-default, right?Fully agree with the two options. Since they are not really mutually exclusive, we could already start with the first and simply improve the docs. We can then add a deprecation warning id no explicit resources are.provided saying this will start to error in the future. And then at some point we can remove the resource defaults to make the protocols scheduler agnostic
Great! If you don't mind, I can open a PR in the upcoming days.
Hi! Any update on this? We are also having this problem, I commented out the offending line in the protocol, but I wanted to know if there is a more long term solution in the pipeline.
Hi @JPchico! I just continued my work on the corresponding PR introducing the warnings, sorry for the delay! I guess we will wait for some time after merging it and then remove the Slurm specific line in a future release, as indicated by @sphuber above.
The PwBandsWorkchain.get_builder_from_protocol() does not support SGE scheduler. This makes it cumbersome to do calculations SGE based machines. See error message below. This happens when providing a code that is run on a SGE computer.