ArjunaCluster / ArjunaUsers

Arjuna Public Documentation for Users
https://arjunacluster.github.io/ArjunaUsers/
14 stars 7 forks source link

Performance Issue #60

Closed venkkris closed 3 years ago

venkkris commented 3 years ago

If you notice a performance issue on arjuna, please first search the existing issues and ensure that it has not been reported. If you notice a similar example, please comment on that issue.

Please provide the following information to help us help you:

Basic Info

Your Name: Venkatesh Krishnamurthy Your Andrew ID: venkatek

Where it happened

Job Ids: 1983, 1984 (directories: /home/venkatek/test/third, /home/venkatek/test/fourth) Node(s) on which the problem occurred: c002 (venkvis_gpu), d001

What Happened

Observed Behavior: Output in job.sh.o1984: " Job started on d001 at Wed Sep 15 02:06:17 UTC 2021

It appears as if there is not enough space for /tmp/ompi.d001.1221/pid.35478/1/shared_mem_cuda_pool.d001 (the shared-memory backing file). It is likely that your MPI job will now either abort or experience performance degradation.

Local host: d001 Space Requested: 134217736 B Space Available: 0 B

[d001:35482] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 134217728

A system call failed during sm BTL initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation.

System call: open(2) Error: No such file or directory (errno 2)

-3.8009621522252894

Job Ended at Wed Sep 15 02:06:28 UTC 2021 "

Job ran to completion in about 6 seconds, expected output printed (-3.8009621522252894).

Notes:

Environment: vkGPAW_21.6.0 * /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0 No spack installed in /home/venkatek/. Job submission script:

!/bin/bash

SBATCH -J test # Job name

SBATCH -n 4 # Number of total cores

SBATCH -N 1 # Number of nodes

SBATCH --time=7-00:00 # Runtime in D-HH:MM

SBATCH -A venkvis # Partition to submit to #gpu/venkvis

SBATCH -p cpu # gpu,cpu,highmem,debug

SBATCH --mem-per-cpu=2000 # Memory pool for all cores in MB (see also --mem-per-cpu)

SBATCH -o job.sh.o%j

SBATCH --mail-type=END # Type of email notification- BEGIN,END,ALL,FAIL

SBATCH --mail-user=venkatek@andrew.cmu.edu

echo "Job started on hostname at date"

mpiexec -n 2 gpaw python single.py

echo " " echo "Job Ended at date"

What I've Tried

Please List what you've tried to debug the issue. Please include commands and resulting output.

1) Googling didn't help

2) gpaw info yielded | python-3.9.6 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/bin/python | | gpaw-21.6.0 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/gpaw/ | | ase-3.22.0 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/ase/ | | numpy-1.20.3 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/numpy/ | | scipy-1.6.3 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/scipy/ | | libxc-4.3.4 yes | | _gpaw /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/_gpaw.cpython-39-x86_64-linux-gnu.so | | MPI enabled yes | | OpenMP enabled no | | scalapack yes | | Elpa no | | FFTW yes | | libvdwxc no | | PAW-datasets (1) /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/share/gpaw |

If you do not have any of the above, please explain why you do not have it, and submit the issue, however, the more information you give us, the better we can help you.

kianpu34593 commented 3 years ago

Please take a look at this issue: https://github.com/ArjunaCluster/ArjunaUsers/issues/59

aabills commented 3 years ago

Agree, this is likely a duplicate. I assume you are using the system MPI?

venkkris commented 3 years ago

Mine is conda installed; I believe Kian's is system MPI.

awadell1 commented 3 years ago

You may need to request disk space in /tmp with --tmp 10G or SBATCH --tmp=10G based on:

It appears as if there is not enough space for /tmp/ompi.d001.1221/pid.35478/1/shared_mem_cuda_pool.d001 (the shared-memory backing

venkkris commented 3 years ago

Added that line to the job submission script:

`#!/bin/bash

SBATCH -J 1disl_0 # Job name

SBATCH -n 32 # Number of total cores

SBATCH -N 1 # Number of nodes

SBATCH --time=7-00:00 # Runtime in D-HH:MM

SBATCH -A venkvis_gpu # Partition to submit to #gpu/venkvis

SBATCH -p gpu # gpu,cpu,highmem,debug

SBATCH --mem-per-cpu=2000 # Memory pool for all cores in MB (see also --mem-per-cpu)

SBATCH -o job.sh.o%j

SBATCH --mail-type=END # Type of email notification- BEGIN,END,ALL,FAIL

SBATCH --mail-user=venkatek@andrew.cmu.edu

SBATCH --tmp=10G

echo "Job started on hostname at date"

mpiexec -n 32 gpaw python relax.py

echo " " echo "Job Ended at date" `

Output: (job.sh.o6375)

Job started on c015 at Sat Sep 18 02:30:05 UTC 2021

It appears as if there is not enough space for /tmp/ompi.c015.1221/pid.2180/1/shared_mem_cuda_pool.c015 (the shared-memory backing file). It is likely that your MPI job will now either abort or experience performance degradation.

Local host: c015 Space Requested: 134217736 B Space Available: 0 B

[c015:02189] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 134217728

A system call failed during sm BTL initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation.

System call: open(2) Error: No such file or directory (errno 2)

[c015:02180] 30 more processes have sent help message help-mpi-btl-smcuda.txt / sys call fail [c015:02180] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Directory: /home/venkatek/dislocation/010/1_unit/5

aabills commented 3 years ago

Try the solution from #59 since this seems to be a duplicate

venkkris commented 3 years ago

Switched from conda-installed gpaw to spack-installed gpaw. Added spack load py-gpaw to job submission script, and ran the script using srun instead of mpirun. This enables me to run scripts on slave nodes using py-gpaw, as well as run my own analysis and post-processing scripts on the head node using the conda install.