NOAA-OWP / ngen

Next Generation Water Modeling Engine and Framework Prototype
Other
84 stars 63 forks source link

Add MPI_Finalize "detouring" mechanism #765

Open program-- opened 8 months ago

program-- commented 8 months ago

This PR resolves #748 by implementing a shim library that overrides MPI_Finalize calls to become no-op. Additionally, calls to MPI_Finalize from NGen code are modified to PMPI_Finalize to call the MPI library's actual implementation.

Adding everyone as reviewers for visibility on this change.

Additions

Changes

TODO

Checklist

Target Environment support

PhilMiller commented 8 months ago

Something needs to say target_link_library(ngen MPIDetour), too

hellkite500 commented 8 months ago

Ubuntu linux tests failing due to a timeout running the tests. Downloading the logs and poking around, I see a LOT of these DEADLYSIGNAL messages comfing from running the test_bmi_c target

2024-03-15T22:38:01.4674964Z [100%] Linking CXX executable test_bmi_c
2024-03-15T22:38:02.1738113Z AddressSanitizer:DEADLYSIGNAL

and eventually the test times out

2024-03-15T22:38:17.3139827Z AddressSanitizer:DEADLYSIGNAL
2024-03-15T22:38:17.3140454Z CMake Error at /usr/local/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:112 (message):
2024-03-15T22:38:17.3140582Z   Error running test executable.
2024-03-15T22:38:17.3140590Z 
2024-03-15T22:38:17.3140878Z     Path: '/home/runner/work/ngen/ngen/cmake_build/test/test_bmi_c'
2024-03-15T22:38:17.3141033Z     Result: Process terminated due to timeout
2024-03-15T22:38:17.3141126Z     Output:
2024-03-15T22:38:17.3141215Z       
2024-03-15T22:38:17.3141230Z 
2024-03-15T22:38:17.3141350Z Call Stack (most recent call first):
2024-03-15T22:38:17.3141791Z   /usr/local/share/cmake-3.28/Modules/GoogleTestAddTests.cmake:225 (gtest_discover_tests_impl)
2024-03-15T22:38:17.3141803Z 
2024-03-15T22:38:17.3141807Z 
2024-03-15T22:38:17.3142098Z gmake[3]: *** [test/CMakeFiles/test_bmi_c.dir/build.make:133: test/test_bmi_c] Error 1
2024-03-15T22:38:17.3142289Z gmake[3]: *** Deleting file 'test/test_bmi_c'
2024-03-15T22:38:17.3142567Z gmake[2]: *** [CMakeFiles/Makefile2:2163: test/CMakeFiles/test_bmi_c.dir/all] Error 2
2024-03-15T22:38:17.3142850Z gmake[1]: *** [CMakeFiles/Makefile2:2170: test/CMakeFiles/test_bmi_c.dir/rule] Error 2
2024-03-15T22:38:17.3142986Z gmake: *** [Makefile:921: test_bmi_c] Error 2
2024-03-15T22:38:17.3154227Z ##[error]Process completed with exit code 2.
2024-03-15T22:38:17.3379867Z Post job cleanup.
program-- commented 8 months ago

Ubuntu linux tests failing due to a timeout running the tests. Downloading the logs and poking around, I see a LOT of these DEDLYSIGNAL messages comfing from running the test_bmi_c target ...

@hellkite500 Yeah this is happening across PR's and the master branch, see: https://github.com/NOAA-OWP/ngen/pull/750#issuecomment-1998828125. Not sure why this is happening, but it's not reproducible locally for neither Phil nor I

hellkite500 commented 8 months ago

Ah, I'm still catching up on a few things. Just wasn't sure if this was related to these changes or not...I looked at a few recent PR's and didn't see the failures. Might try getting an Ubuntu image spun up to test on.

PhilMiller commented 7 months ago

Converting back to draft, since we may not need this for the time being if ESMF PR https://github.com/esmf-org/esmf/pull/234 gets merged and released quickly. It could be useful for other uncooperative libraries that insist on acting like they own MPI, though.

donaldwj commented 7 months ago

Where is the PMPI_Finalize coming from? I can not find documentation on It, I am guessing that it is directly called by the MPI library provided MPI_Finalize. Also is using this compatible across different MPI implementations? Ok this is the from section 15 of the MPI standard, documentation on this is not something easy to find. But It looks like this should be fine on all MPI implementations that meet the standard.