LLNL / Aluminum

High-performance, GPU-aware communication library
https://aluminum.readthedocs.io/en/latest/
Other
84 stars 21 forks source link

[1.4.1] Tests crash #211

Closed yurivict closed 1 year ago

yurivict commented 1 year ago
===>  Testing for Aluminum-1.4.1
===>   Aluminum-1.4.1 depends on package: cxxopts>0 - found
-- Configuring done (4.9s)
-- Generating done (0.0s)
-- Build files have been written to: /usr/ports/net/aluminum/work/.build
ninja: no work to do.
[  0% 1/1] cd /usr/ports/net/aluminum/work/.build && /usr/local/bin/ctest --force-new-ctest-process
Test project /usr/ports/net/aluminum/work/.build
No tests were found!!!
[yv:33027] *** Process received signal ***
[yv:33027] Signal: Segmentation fault (11)
[yv:33027] Signal code: Address not mapped (1)
[yv:33027] Failing at address: 0x440000c8
[yv:33027] [ 0] 0x826d6762c <pthread_sigmask+0x54c> at /lib/libthr.so.3
[yv:33027] [ 1] 0x826d66bd9 <pthread_setschedparam+0x839> at /lib/libthr.so.3
[yv:33027] [ 2] 0x7ffffffff923 <_fini+0x7fffffdd3aa7> at ???
[yv:33027] [ 3] 0x824332fe8 <MPI_Comm_get_attr+0x58> at /usr/local/mpi/openmpi/lib/libmpi.so.40
[yv:33027] [ 4] 0x821b7a4b2 <_ZN2Al8internal3mpi4initERiRPPci+0x102> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 5] 0x821b7735a <_ZN2Al10InitializeERiRPPci+0x1a> at /usr/ports/net/aluminum/work/.build/src/libAl.so.1.4.1
[yv:33027] [ 6] 0x20d730 <main+0x40> at /usr/ports/net/aluminum/work/.build/test/test_exchange
[yv:33027] *** End of error message ***
*** Signal 11

clang-15 FreeBSD 13.2

ndryden commented 1 year ago

Unfortunately, Aluminum's tests are not actually integrated with ctest (for a variety of reasons, primarily related to them all needing MPI to run). I suspect this issue is due to MPI not initializing properly but not throwing an error, and the subsequent call to MPI_Comm_get_attr segfaulting.

It's a bit odd, but sadly not too surprising, that the segfault is occurring inside an MPI call, especially if MPI failed to initialize properly.

Still, Aluminum should probably try to better detect whether MPI has initialized successfully so I will attempt to make it a bit more robust.

If you suspect a deeper issue here, please re-open or make a new issue. Thanks!