aiidateam / aiida-code-registry

Registry of simulation codes and computers for easy setup in AiiDA.
2 stars 11 forks source link

update mpiprocs_per_machine for merlin #84

Closed superstar54 closed 1 year ago

superstar54 commented 1 year ago

Fix https://github.com/aiidalab/aiidalab-qe/issues/413

On Merlin 6, for CPU cluster, the max CPUs is 44. For GPU cluster, the max CPUs depends on the nodes, I set 20 here, because ~ 80% of nodes have a max CPUs of 20.

unkcpz commented 1 year ago

I don't want to say "I said so". You mentioned put a CPUs number larger than 2 is always waiting on the queue. Can you check it is true or the machine is now updated and users are allowed to use more without waiting on queue?

superstar54 commented 1 year ago

I don't want to say "I said so". You mentioned put a CPUs number larger than 2 is always waiting on the queue. Can you check it is true or the machine is now updated and users are allowed to use more without waiting on queue?

Whether to wait in the queue always depends on the number of users and submitted jobs, I can't answer this question unless I have a detailed report from the merlin6. But to be noted that, merlin6 is different than Eiger at CSCS. On Eiger, the node is not shared between jobs, so the mpiprocs_per_machine should be set to the max CPUs. On Merlin, the node is shared by different jobs. I just checked the queue, here are the lastest jobs, Screenshot from 2023-06-08 14-18-09 As you can see that there are six jobs run on the same node, merlin-c-207, and I checked that each job is using 2 cpus. And most jobs running in merlin use such a small number of CPUs.

Back to the configuration file, it has two options:

  1. set it to 1, which is a safe choice in merlin6.
  2. set it to the max CPUs of the machine, which is 44 in CPU clusters.

Normally, both options are good because the user can always override the value by their input. However, in the case of QeApp, it uses this value as the maximum number of CPUs the user can use. In this case, we are forced to use option 2.

unkcpz commented 1 year ago

In this case, we are forced to use option 2.

You mean this value gives the user information on the maximum number they can use, correct? I think this is what Giovanni proposed previously, I agree with this. Approved.