Closed venkkris closed 3 years ago
Please take a look at this issue: https://github.com/ArjunaCluster/ArjunaUsers/issues/59
Agree, this is likely a duplicate. I assume you are using the system MPI?
Mine is conda installed; I believe Kian's is system MPI.
You may need to request disk space in /tmp
with --tmp 10G
or SBATCH --tmp=10G
based on:
It appears as if there is not enough space for /tmp/ompi.d001.1221/pid.35478/1/shared_mem_cuda_pool.d001 (the shared-memory backing
`#!/bin/bash
echo "Job started on hostname
at date
"
mpiexec -n 32 gpaw python relax.py
echo " "
echo "Job Ended at date
"
`
Job started on c015 at Sat Sep 18 02:30:05 UTC 2021
It appears as if there is not enough space for /tmp/ompi.c015.1221/pid.2180/1/shared_mem_cuda_pool.c015 (the shared-memory backing file). It is likely that your MPI job will now either abort or experience performance degradation.
Local host: c015 Space Requested: 134217736 B Space Available: 0 B
[c015:02189] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 134217728
A system call failed during sm BTL initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation.
System call: open(2) Error: No such file or directory (errno 2)
[c015:02180] 30 more processes have sent help message help-mpi-btl-smcuda.txt / sys call fail [c015:02180] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Try the solution from #59 since this seems to be a duplicate
Switched from conda-installed gpaw to spack-installed gpaw. Added spack load py-gpaw
to job submission script, and ran the script using srun
instead of mpirun
. This enables me to run scripts on slave nodes using py-gpaw, as well as run my own analysis and post-processing scripts on the head node using the conda install.
If you notice a performance issue on arjuna, please first search the existing issues and ensure that it has not been reported. If you notice a similar example, please comment on that issue.
Please provide the following information to help us help you:
Basic Info
Your Name: Venkatesh Krishnamurthy Your Andrew ID: venkatek
Where it happened
Job Ids: 1983, 1984 (directories: /home/venkatek/test/third, /home/venkatek/test/fourth) Node(s) on which the problem occurred: c002 (venkvis_gpu), d001
What Happened
Observed Behavior: Output in job.sh.o1984: " Job started on d001 at Wed Sep 15 02:06:17 UTC 2021
It appears as if there is not enough space for /tmp/ompi.d001.1221/pid.35478/1/shared_mem_cuda_pool.d001 (the shared-memory backing file). It is likely that your MPI job will now either abort or experience performance degradation.
Local host: d001 Space Requested: 134217736 B Space Available: 0 B
[d001:35482] create_and_attach: unable to create shared memory BTL coordinating strucure :: size 134217728
A system call failed during sm BTL initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation.
System call: open(2) Error: No such file or directory (errno 2)
-3.8009621522252894
Job Ended at Wed Sep 15 02:06:28 UTC 2021 "
Job ran to completion in about 6 seconds, expected output printed (-3.8009621522252894).
Notes:
Environment: vkGPAW_21.6.0 * /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0 No spack installed in /home/venkatek/. Job submission script:
!/bin/bash
SBATCH -J test # Job name
SBATCH -n 4 # Number of total cores
SBATCH -N 1 # Number of nodes
SBATCH --time=7-00:00 # Runtime in D-HH:MM
SBATCH -A venkvis # Partition to submit to #gpu/venkvis
SBATCH -p cpu # gpu,cpu,highmem,debug
SBATCH --mem-per-cpu=2000 # Memory pool for all cores in MB (see also --mem-per-cpu)
SBATCH -o job.sh.o%j
SBATCH --mail-type=END # Type of email notification- BEGIN,END,ALL,FAIL
SBATCH --mail-user=venkatek@andrew.cmu.edu
echo "Job started on
hostname
atdate
"mpiexec -n 2 gpaw python single.py
echo " " echo "Job Ended at
date
"What I've Tried
Please List what you've tried to debug the issue. Please include commands and resulting output.
1) Googling didn't help
2)
gpaw info
yielded | python-3.9.6 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/bin/python | | gpaw-21.6.0 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/gpaw/ | | ase-3.22.0 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/ase/ | | numpy-1.20.3 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/numpy/ | | scipy-1.6.3 /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/scipy/ | | libxc-4.3.4 yes | | _gpaw /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/lib/python3.9/site-packages/_gpaw.cpython-39-x86_64-linux-gnu.so | | MPI enabled yes | | OpenMP enabled no | | scalapack yes | | Elpa no | | FFTW yes | | libvdwxc no | | PAW-datasets (1) /home/venkatek/.miniconda3/envs/vkGPAW_21.6.0/share/gpaw |