Closed lxz12 closed 1 year ago
This looks like an issue with hwloc. What version of hwloc are you using? Did you build with CUDA or ROCm support?
I'm not sure the best way to debug this, since it may be an environment issue on your side. It appears that hwloc is returning a cpuset that contains infinite set bits, which is strange.
Closing, please respond here with new information or open a new issue if there are other problems.
Following the README, I executed the cmake and make directives. Then I went to the example directory, followed the README in example, executed make, and got three executables. hello_world, allreduce, pingpong. When I execute./hello_world, I get the following error. I want to fix this error
shangda02@abc-Super-Server:~/LLNL_Aluminum/examples/build$ ./hello_world terminate called after throwing an instance of 'Al::al_exception' what(): /home/shangda02/LLNL_Aluminum/src/progress.cpp:88 - Tried to exchange infinite bitmap [abc-Super-Server:29183] Process received signal [abc-Super-Server:29183] Signal: Aborted (6) [abc-Super-Server:29183] Signal code: (-6) [abc-Super-Server:29183] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef10)[0x7fdb3bd6ef10] [abc-Super-Server:29183] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fdb3bd6ee87] [abc-Super-Server:29183] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fdb3bd707f1] [abc-Super-Server:29183] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7fdb3c3c5957] [abc-Super-Server:29183] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7fdb3c3cbae6] [abc-Super-Server:29183] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92b21)[0x7fdb3c3cbb21] [abc-Super-Server:29183] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92d54)[0x7fdb3c3cbd54] [abc-Super-Server:29183] [ 7] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngine9bind_initEv + 0 xbc8) [0 x7fdb3ca1e678 ] [abc-Super-Server:29183] [ 8] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al8internal14ProgressEngineC1Ev + 0 x14b) x7fdb3ca1ec3b [0] [abc-Super-Server:29183] [ 9] / home/shangda02 / LLNL_Aluminum/build/SRC/libAl. So. 1.3.1 (_ZN2Al10InitializeERiRPPcP19ompi_communicator_t + 0 x39) [0 x7fdb3ca19 a89] [abc-Super-Server:29183] [10] ./hello_world(+0xe30)[0x5598fd9ace30] [abc-Super-Server:29183] [11] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7fdb3bd51c87] [abc-Super-Server:29183] [12] ./hello_world(+0xfca)[0x5598fd9acfca] [abc-Super-Server:29183] End of error message Aborted (core dumped)
To be honest, I don't have a good understanding of the whole project