ArjunaCluster / ArjunaUsers

Arjuna Public Documentation for Users
https://arjunacluster.github.io/ArjunaUsers/
14 stars 7 forks source link

Performance Issue: gpaw parallel running issue #59

Closed kianpu34593 closed 3 years ago

kianpu34593 commented 3 years ago

If you notice a performance issue on arjuna, please first search the existing issues and ensure that it has not been reported. If you notice a similar example, please comment on that issue.

Please provide the following information to help us help you:

Basic Info

Your Name: jiankun pu Your Andrew ID: jiankunp

If you are not a frequent github user, please also provide us with a contact email here: Contact Email:

Where it happened

Job Ids: 1465 Node(s) on which the problem occurred: c004

What Happened

Expected Behavior: running Observed Behavior: stuck but not failed

Log Files

Location of Log file Showing the Error: /home/jiankunp/projects/gpaw_install_benchmark/spack_2nd_install_openmpi/parrallel_test/gpu_multi_core_1465.err Location of Script showing Minimum Working Example: /home/jiankunp/projects/gpaw_install_benchmark/spack_2nd_install_openmpi/parrallel_test/gpu_multi_core.sh

Please also attach any logs and the submission script to this issue.

What I've Tried

Please List what you've tried to debug the issue. Please include commands and resulting output.

1) I googled https://github.com/open-mpi/ompi/issues/5798 I added: --mca shmem posix to mpiexec. It seems to work now. But it's very slow.

If you do not have any of the above, please explain why you do not have it, and submit the issue, however, the more information you give us, the better we can help you.

kianpu34593 commented 3 years ago

Now the problem could be solved by installing: spack install py-gpaw ^openmpi schedulers=slurm pmi=true Instead of running with mpiexec/mpirun, the installed openmpi only supports srun. Therefore you can change the running command to: srun -n [number of cores] gpaw python script.py