cornelisnetworks / opa-psm2

Other
37 stars 29 forks source link

traps: IMB-EXT[3803] trap invalid opcode ip:7f0b34d7d5f9 sp:7fffec3ce350 error:0 in libpsm2.so.2.1[7f0b34d5e000+62000] #40

Closed ghost closed 4 years ago

ghost commented 5 years ago

The libpsm2.so.2.1 was compiled with avx2 instruction enabled. When we run MPI test over an old machine, the MPI test program terminated by signal 4.


[root@rdma05 ~]# /usr/lib64/openmpi3/bin/mpirun --allow-run-as-root -mca  mtl psm2 -np 2 -H rdma05,rdma06  /opt/mpi-benchmarks/IMB-EXT 
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 33336 on node rdma06 exited on signal 4 (Illegal instruction).
--------------------------------------------------------------------------

[root@rdma05 ~]# gdb -q /opt/mpi-benchmarks/IMB-EXT core.25245
Reading symbols from /opt/mpi-benchmarks/IMB-EXT...(no debugging symbols found)...done.
[New LWP 25245]
[New LWP 25246]
[New LWP 25247]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/opt/mpi-benchmarks/IMB-EXT'.
Program terminated with signal 4, Illegal instruction.
#0  0x00007fbf559d55f9 in psmi_mq_req_init (mq=mq@entry=0xa97950) at /usr/src/debug/libpsm2-11.2.78/psm_mq_utils.c:142
142         struct psmi_rlimit_mpool rlim = MQ_SENDREQ_LIMITS;
Missing separate debuginfos, use: debuginfo-install hwloc-libs-1.11.8-4.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libibumad-22.1-3.el7.x86_64 libibverbs-22.1-3.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 librdmacm-22.1-3.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libtool-ltdl-2.4.2-22.el7_3.x86_64 openmpi3-3.1.3-2.el7.x86_64 opensm-libs-3.3.21-2.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00007fbf559d55f9 in psmi_mq_req_init (mq=mq@entry=0xa97950) at /usr/src/debug/libpsm2-11.2.78/psm_mq_utils.c:142
#1  0x00007fbf559d4fa9 in psmi_mq_malloc (mqo=mqo@entry=0x7ffcfd9172c0) at /usr/src/debug/libpsm2-11.2.78/psm_mq.c:1571
#2  0x00007fbf559c88e9 in __psm2_ep_open (unique_job_key=0x7ffcfd917550 "\037j\240\061\226\200\232\200}\025a\210\270\212\305\066\061", 
    opts_i=0x7ffcfd917480, epo=0x7ffcfd917468, epido=0x7ffcfd917478) at /usr/src/debug/libpsm2-11.2.78/psm_ep.c:1044
#3  0x00007fbf45e41fef in ompi_mtl_psm2_module_init () from /usr/lib64/openmpi3/lib/openmpi/mca_mtl_psm2.so
#4  0x00007fbf45e423cf in ompi_mtl_psm2_component_init () from /usr/lib64/openmpi3/lib/openmpi/mca_mtl_psm2.so
#5  0x00007fbf5ae369b3 in ompi_mtl_base_select () from /usr/lib64/openmpi3/lib/libmpi.so.40
#6  0x00007fbf469838dc in mca_pml_cm_component_init () from /usr/lib64/openmpi3/lib/openmpi/mca_pml_cm.so
#7  0x00007fbf5ae3f108 in mca_pml_base_select () from /usr/lib64/openmpi3/lib/libmpi.so.40
#8  0x00007fbf5add4fa9 in ompi_mpi_init () from /usr/lib64/openmpi3/lib/libmpi.so.40
#9  0x00007fbf5adff595 in PMPI_Init_thread () from /usr/lib64/openmpi3/lib/libmpi.so.40
#10 0x000000000040c095 in main ()
(gdb) disassemble 0x00007fbf559d55f9

   0x00007fbf559d55eb <+123>:   movl   $0x1,0x60(%rsp)
   0x00007fbf559d55f3 <+131>:   vmovdqu %xmm0,0x64(%rsp)
=> 0x00007fbf559d55f9 <+137>:   vextracti128 $0x1,%ymm0,0x74(%rsp)
   0x00007fbf559d5601 <+145>:   vzeroupper 
   0x00007fbf559d5604 <+148>:   callq  0x7fbf559ce140 <psmi_parse_mpool_env>

vextracti128 is an avx2 instruction, but the old machine only support avx instruction set.

[root@rdma05 ~]# grep avx /proc/cpuinfo | head -n 1
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d

The problem is that compiled libpsm2 library can't be used on old machine.

bmyates commented 5 years ago

you can compile the PSM2 library without AVX2 instructions by setting PSM_DISABLE_AVX2 during the make. This will reduce performance, but there's no way around that on older machines.