PASSIONLab / CombBLAS

The Combinatorial BLAS (CombBLAS) is an extensible distributed-memory parallel graph library offering a small but powerful set of linear algebra primitives specifically targeting graph analytics.
Other
59 stars 20 forks source link

Illegal Operand from MTRand::initialize() #2

Closed JustinClough closed 3 years ago

JustinClough commented 3 years ago

I get the following error when using another library based on this CombBLAS code:

[e16-24:11747:0:11747] Caught signal 4 (Illegal instruction: illegal operand)
/project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h: [ MTRand::initialize() ]  
...                                                                                                                                                                                                                                   290       int i = 1;                                         
291       *s++ = seed & 0xffffffffUL;         
292       for( ; i < N; ++i )                                                                                                                                                                
==>   293       {                                                                                                                                                                                                                           294               *s++ = ( 1812433253UL * ( *r ^ (*r >> 30) ) + i ) & 0xffffffffUL;                                                                      
295               r++;                                                         
296       }    
==== backtrace (tid:  11747) ====                                                                                                                                                                                                      
0 0x000000000042383c MTRand::initialize()  
/project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:293                                                                                   
1 0x000000000042383c MTRand::seed()  
/project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:218                                                                                         
2 0x000000000042383c MTRand::MTRand()  
/project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:138                                                                                       
3 0x000000000042383c __static_initialization_and_destruction_0()  
/project/aoberai_286/jlclough/trilinos_project/tpls_src/CombBLAS/Applications/BipartiteMatchings/BPMaximalMatching.h:17                                             
4 0x000000000042383c _GLOBAL__sub_I_c2cpp_GetAWPM.cpp()  
/project/aoberai_286/jlclough/trilinos_project/tpls_src/superlu_dist-5.4.0/SRC/c2cpp_GetAWPM.cpp:63                                                                          
5 0x000000000054c66d __libc_csu_init()  ???:0                                                                                                                                                                                         
6 0x00000000000224e5 __libc_start_main()  ???:0                                                                                                                                                                                       
7 0x0000000000423af4 _start()  ???:0                                                                                                                                                                                                 
=================================

This has been a heisen-bug for me. I have the same CombBLAS version installed on another machine but haven't seen this error. Additionally, I only sometimes get this error; I haven't figured out a pattern or exact way to replicate it yet (sorry). My current workaround is to wait about 10 minutes and try again.

I was hoping there was a more permanent solution to this. Any ideas or advice?

Some other helpful info:

Let me know if you need any other information from me. Thank you!

aydinbuluc commented 3 years ago

Justin,

I can’t imagine the use case where SuperLU calls GenGraph500Data() function. Is this really within a functional code or just for benchmarking/installation/unit testing?

On Fri, Oct 9, 2020 at 4:23 PM Justin Clough notifications@github.com wrote:

I get the following error when using another library based on this CombBLAS code:

[e16-24:11747:0:11747] Caught signal 4 (Illegal instruction: illegal operand) /project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h: [ MTRand::initialize() ] ... 290 int i = 1; 291 s++ = seed & 0xffffffffUL; 292 for( ; i < N; ++i ) ==> 293 { 294 s++ = ( 1812433253UL ( r ^ (*r >> 30) ) + i ) & 0xffffffffUL; 295 r++; 296 } ==== backtrace (tid: 11747) ==== 0 0x000000000042383c MTRand::initialize() /project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:293 1 0x000000000042383c MTRand::seed() /project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:218 2 0x000000000042383c MTRand::MTRand() /project/aoberai_286/jlclough/trilinos_project/tpls_install/combblas/include/psort/MersenneTwister.h:138 3 0x000000000042383c static_initialization_and_destruction_0() /project/aoberai_286/jlclough/trilinos_project/tpls_src/CombBLAS/Applications/BipartiteMatchings/BPMaximalMatching.h:17 4 0x000000000042383c _GLOBALsub_I_c2cpp_GetAWPM.cpp() /project/aoberai_286/jlclough/trilinos_project/tpls_src/superlu_dist-5.4.0/SRC/c2cpp_GetAWPM.cpp:63 5 0x000000000054c66d libc_csu_init() ???:0 6 0x00000000000224e5 libc_start_main() ???:0 7 0x0000000000423af4 _start() ???:0

This has been a heisen-bug for me. I have the same CombBLAS version installed on another machine but haven't seen this error. Additionally, I only sometimes get this error; I haven't figured out a pattern or exact way to replicate it yet (sorry). My current workaround is to wait about 10 minutes and try again.

I was hoping there was a more permanent solution to this. Any ideas or advice?

Some other helpful info:

  • My CombBLAS version is from this commit: e6c55bd https://github.com/PASSIONLab/CombBLAS/commit/e6c55bd48a442b8fa95870fb5f18cd6b89cbffe9
  • The other library is SuperLU-Dist. I'm using version 5.4.0. (link to library) https://github.com/xiaoyeli/superlu_dist
  • The machine I have not had this problem on is a desktop:
    • It's running Ubuntu 19.10
    • I'm using gcc v9.2.1 and openmpi v3.1.3
  • The machine I am having this problem on is USC's HPC https://carc.usc.edu/:
    • It's running CentOS 7.7.1908
    • I'm using gcc v9.2.0 and openmpi v4.0.2
  • I also get a similar "Illegal Operand" error when running ctest. These also come and go at seemingly random intervals; either they all pass or none of them do. When they do not pass, I get this output (example from ctest's test 1):

[e13-11:12671:0:12671] Caught signal 4 (Illegal instruction: illegal operand) ==== backtrace (tid: 12671) ==== 0 0x0000000000411501 combblas::DistEdgeList::GenGraph500Data() ???:0 1 0x000000000040d32b main() ???:0 2 0x0000000000022555 __libc_start_main() ???:0 3 0x000000000040d984 _start() ???:0

Let me know if you need any other information from me. Thank you!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PASSIONLab/CombBLAS/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMJ7L5G6VPZGDEPL7V2T6TSJ6LQNANCNFSM4SKUIEKA .

JustinClough commented 3 years ago

Hi Aydin,

You're right in that SuperLU isn't calling GenGraph500Data(). That output is from the first test that ctest does. The exact test name is GenMMWrite_Test. I included that to possibly help with debugging.

The core of my issue is when SuperLU calls MTRand::initialize() via BPMaximalMatching. That call does happen both for unit tests and end-use applications.

Thanks, -Justin

JustinClough commented 3 years ago

I found that the core of the issue is actually from a niche compiler-CPU mismatch issue that comes out with CombBLAS, not CombBLAS itself.

There's more details on that issue here and here.

The CPU types I get this error on are xeon-2640v3 and v4. I can successfully run CombBLAS on xeon-6130.

I originally couldn't figure out how to reproduce this bug as slurm would assign the test to whatever compute node was free at that moment. Depending on the CPUs in that node, the tests would pass or fail accordingly.