LLNL / Aluminum

High-performance, GPU-aware communication library
https://aluminum.readthedocs.io/en/latest/
Other
84 stars 21 forks source link

I'm a newbie. Can I ask you a question #205

Closed lxz12 closed 1 year ago

lxz12 commented 1 year ago

Following the README, I executed the cmake and make directives. Then I went to the example directory, followed the README in example, executed make, and got three executables. hello_world, allreduce, pingpong. When I execute./hello_world, I get the following error. I want to fix this error

shangda02@abc-Super-Server:~/LLNL_Aluminum/examples/build$ ./hello_world terminate called after throwing an instance of 'Al::al_exception' what(): /home/shangda02/LLNL_Aluminum/src/progress.cpp:88 - Tried to exchange infinite bitmap [abc-Super-Server:29183] Process received signal [abc-Super-Server:29183] Signal: Aborted (6) [abc-Super-Server:29183] Signal code: (-6) [abc-Super-Server:29183] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fdb3bd6ef10] [abc-Super-Server:29183] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdb3bd6ee87] [abc-Super-Server:29183] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdb3bd707f1] [abc-Super-Server:29183] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7fdb3c3c5957] [abc-Super-Server:29183] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7fdb3c3cbae6] [abc-Super-Server:29183] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7fdb3c3cbb21] [abc-Super-Server:29183] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7fdb3c3cbd54] [abc-Super-Server:29183] [ 7] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngine9bind_initEv + 0 xbc8) [0 x7fdb3ca1e678 ] [abc-Super-Server:29183] [ 8] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngineC1Ev + 0 x14b) x7fdb3ca1ec3b [0] [abc-Super-Server:29183] [ 9] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al10InitializeERiRPPcP19ompi_communicator_t + 0 x39) [0 x7fdb3ca19 a89] [abc-Super-Server:29183] [10] ./hello_world(+0xe30)[0x5598fd9ace30] [abc-Super-Server:29183] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fdb3bd51c87] [abc-Super-Server:29183] [12] ./hello_world(+0xfca)[0x5598fd9acfca] [abc-Super-Server:29183] End of error message Aborted (core dumped)

To be honest, I don't have a good understanding of the whole project

ndryden commented 1 year ago

This looks like an issue with hwloc. What version of hwloc are you using? Did you build with CUDA or ROCm support?

I'm not sure the best way to debug this, since it may be an environment issue on your side. It appears that hwloc is returning a cpuset that contains infinite set bits, which is strange.

ndryden commented 1 year ago

Closing, please respond here with new information or open a new issue if there are other problems.