cornelisnetworks / opa-psm2

Other
36 stars 29 forks source link

hfi_userinit: mmap of status page (dabbad0008030000) failed: Operation not permitted #29

Open adammoody opened 6 years ago

adammoody commented 6 years ago

This is an informational post for other PSM2 users. In Red Hat Enterprise Linux 7.5, we found that some of our MPI executables exit with the following error:

hfi_userinit: mmap of status page (dabbad0008030000) failed: Operation not permitted

This error is thrown from this line of code in the PSM2 library:

https://github.com/intel/opa-psm2/blob/0f9213e7af8d32c291d4657ff4a3279918de1e60/opa/opa_proto.c#L482-L484

We tracked this down to the execute bit being set in the GNU_STACK of the ELF headers in a binary. That in turn attempts to map the memory region with both the read and execute bits enabled, rather than just the read bit as PSM2 is requesting. As described in this post:

https://stackoverflow.com/questions/32730643/why-in-mmap-prot-read-equals-prot-exec

"For what I understand, GNU_STACK program header is designed to tell the kernel that you want some specific properties for the stack, one those properties is a non-executable stack. It appears that if you don't explicitly ask for a non-executable stack, all the ELF sections marked as readable will be executable too. And also all the memory mapping with mmap while have the same behavior."

One can inspect a binary for this setting using readelf:

readelf --program-headers a.out

We could reproduce this by running a simple MPI program that was compiled with PGI.

For example, a binary built with PGI shows:

readelf --program-headers mpiBench_pgi

GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RWE 10

Whereas a binary built with GNU:

readelf --program-headers mpiBench_gnu

GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 10

We found that a work around is to add "-Wl,-z,noexecstack" during the link step. Alternatively, one can force this bit off in an existing executable with execstack:

execstack -c a.out

adammoody commented 6 years ago

Related commit in IB/hfi driver: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=12220267645cb7d1f3f699218e0098629e932e1f

lee218llnl commented 6 years ago

Note that one can also use the execstack utility to query the executable stack flag of a binary:

bash-4.2$ mpicc test.c 
bash-4.2$ execstack -q a.out
X a.out
bash-4.2$ execstack -c a.out
bash-4.2$ execstack -q a.out
- a.out
bash-4.2$ mpicc -Wl,-z,noexecstack test.c
bash-4.2$ execstack -q a.out
- a.out
jtfrey commented 5 years ago

This does, indeed, mitigate the issue in some cases. However, in one particular case I've encountered:

Recompiling a simple MPI program with -Wl,-z,noexecstack addressed the mapping of the HFI capabilities pages, but the program died with a segmentation fault shortly after worker 0 started:

Program terminated with signal 11, Segmentation fault.
#0  0x00002b2f8d1419d8 in ompi_mtl_psm2_progress () at ./mtl_psm2.c:426
426         completed++;
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.168-8.el7.x86_64 elfutils-libs-0.168-8.el7.x86_64 glibc-2.17-196.el7_4.2.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-9.el7.x86_64 libibverbs-13-7.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 libpsm2-10.3.35-1.x86_64 librdmacm-13-7.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 systemd-libs-219-42.el7_4.10.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x00002b2f8d1419d8 in ompi_mtl_psm2_progress () at ./mtl_psm2.c:426
#1  0x00002b2f79ebab05 in opal_progress () at runtime/opal_progress.c:227
#2  0x00002b2f8c90220d in ompi_request_wait_completion () at ../../../../ompi/request/request.h:412
#3  0x00002b2f8c9008cc in mca_pml_cm_recv () at ./pml_cm.h:213
#4  0x00002b2f78356b2a in PMPI_Recv () at ./precv.c:79
#5  0x0000000000401583 in main (argc=2, argv=0x7ffc9a98f3b8) at /home/1001/sw/mpibounce_2/mpibounce.c:100

The segfault did not vary w.r.t. the version of Open MPI under the program: the segfault always occurred at the increment of completed which follows a call to psm2_mq_test2(). Since completed is a local variable (on the stack) the PSM2 library must be doing something to the stack that is in conflict with the treatment of the stack emitted by the PGI compiler (which compiled the code surrounding the call to psm2_mq_test2()).

This made me recall another issue we'd encountered: Gaussian Inc's use of -tp nehalem and the PGI 18 compiler produced code that was numerically unstable on Skylake processors for certain inputs. Altering to -tp haswell seemed to address the issue, indicating that certain Nehalem-era optimizations must no longer be 100% compatible on newer processors. On our Broadwell cluster PGI defaults to using -tp haswell when no explicit option is provided, which is how Open MPI was being built. With Broadwell being a tock up from Haswell, Portland's expectation must have been that any Haswell optimizations would work on Broadwell: they skip from haswell to skylake with the -tp option. Perhaps this is NOT the case. To test that theory:

This combination of Portland compiler, Open MPI, and PSM2 does NOT fail to map the HFI capabilities AND does not segfault. This naturally calls into question what level of PGI processor optimization is 100% reliable on a Broadwell system.

LadaF commented 5 years ago

This is hardly a solution. Any program that passes around nested functions needs executable stack. That's standard Fortran and GNU-extension in C. It is used in very handy techniques. I basically canot run my program on a cluster that uses PSM2 now.

Note that it happens for GCC compiled programs as well.

weiny2 commented 5 years ago

This restriction on EXEC has been removed in the upstream kernel by the following commit. I'm not sure when any specific distros will be pulling it back but it may be worth asking your specific distro to do so.

I suggest we close this issue as it was not a PSM2 library restriction. Just me being overly restrictive with security in the kernel. Or this can remain open until all the distros have had a chance to pull the patch.

commit 7709b0dc265f28695487712c45f02bbd1f98415d
Author: Michael J. Ruhl michael.j.ruhl@intel.com
Date: Thu Jan 17 12:42:04 2019 -0800

IB/hfi1: Remove overly conservative VM_EXEC flag check                                                                                        

Applications that use the stack for execution purposes cause userspace PSM                                                                    
jobs to fail during mmap().                                                                                                                   

Both Fortran (non-standard format parsing) and C (callback functions                                                                          
located in the stack) applications can be written such that stack                                                                             
execution is required. The linker notes this via the gnu_stack ELF flag.                                                                      

This causes READ_IMPLIES_EXEC to be set which forces all PROT_READ mmaps                                                                      
to have PROT_EXEC for the process.                                                                                                            

Checking for VM_EXEC bit and failing the request with EPERM is overly                                                                         
conservative and will break any PSM application using executable stacks.                                                                      

Cc: <stable@vger.kernel.org> #v4.14+                                                                    
Fixes: 12220267645c ("IB/hfi: Protect against writable mmap")                                                                                 
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>                                                                                    
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>                                                                                
Reviewed-by: Ira Weiny <ira.weiny@intel.com>                                                                                                  
Signed-off-by: Michael J. Ruhl <michael.j.ruhl@intel.com>                                           
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>                                                                              
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>                                                                                             
LadaF commented 5 years ago

@weiny2 Is it a commit to the Linux kernel? I may try to persuade the admin to apply it.

weiny2 commented 5 years ago

@weiny2 Is it a commit to the Linux kernel? I may try to persuade the admin to apply it.

This is a commit to the HFI1 driver. Our driver is upstream so yes that is the commit information for the Linux kernel. I only mention this to make sure you are not running an out of tree driver. Because if so then you need to apply the patch to that driver.