eth-cscs / pyfr

pyfr@cscs (https://github.com/vincentlab/PyFR)
0 stars 0 forks source link

Unrelated (?) other job failures #9

Open jgphpc opened 8 years ago

jgphpc commented 8 years ago

All logs are in /project/csstaff/inputs/pyfr/d/

job451657 (V14) / 24 Aug.

#SBATCH --nodes=1024
#SBATCH --ntasks=1024
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=16
#SBATCH --ntasks-per-core=2
===> *** glibc detected *** exe: 
double free or corruption (!prev): 0x00002aaaf2e05cb0 ***
cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
*** glibc detected *** exe: double free or corruption (!prev):

=> nodes (3471 and 5137) removed from queue.

job451689 (V3) / 24 Aug.

cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
*** glibc detected *** exe: double free or corruption (!prev):

=> nodes (3471 and 5137) removed from queue.

job451660 (V12) / 24 Aug.

cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
*** glibc detected *** exe: double free or corruption (!prev):

=> nodes (3471 and 5137) removed from queue.

job451657 (V14) / 24 Aug.

cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
*** glibc detected *** exe: double free or corruption (!prev):

=> nodes (3471 and 5137) removed from queue.

job451671 (V16) / 24 Aug.

cudaMemcpyHostToDevice Async error :: ret code: an illegal memory access was encountered
*** glibc detected *** exe: double free or corruption (!prev): 0x00002aab5b217ee0

=> nodes (3471 and 5137) removed from queue.

iyer-arvind commented 8 years ago

not sure. never seen this from my job.

vkarak commented 8 years ago

jobid 489921 (1 Sept 22h)

cudaMemcpyHostToDevice Async error :: ret code: 
an illegal memory access was encountered
jgphpc commented 8 years ago

job492166 (2 Sept)

cudaMemcpyHostToDevice Async error :: ret code: 
an illegal memory access was encountered
jgphpc commented 8 years ago

job489880 (2 Sept)

srun: error: task 937 launch failed: Error configuring interconnect
Fri Sep  2 08:31:17 2016: [PE_767]:inet_connect:inet_connect: 
connect failed after 301 attempts
Fri Sep  2 08:31:17 2016: [PE_767]:_pmi_inet_setup:inet_connect failed
Fri Sep  2 08:31:17 2016: [PE_767]:_pmi_init:_pmi_inet_setup (full) returned -1
jgphpc commented 8 years ago

job494937 (5 Sept)

cudaMemcpyHostToDevice Async error :: ret code: 
an illegal memory access was encountered