ORNL-Fusion / PARVMEC

3D Equilibrium Solver
MIT License
13 stars 2 forks source link

Multiple processor crash #10

Open cianciosa opened 3 years ago

cianciosa commented 3 years ago

Joachim Geiger has reported a crash when running with multiple processors. The following input files Cases.zip show the behavior. input.crashes uses an extended number of modes and crashed with a heap-overflow error when run with more than a single processor. Theinput.works` is the same case with a reduced number of modes. This cases does not exhibit the behavior. The crash was reported using the ifort compiler however, I was able to reproduce this crash by turning on the address-sanitizer flag.

% mpirun -n 4 xvmec input.crashes_3    
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  VMEC OUTPUT FILES ALREADY EXIST: OVERWRITING THEM ...
  SEQ =    1 TIME SLICE  0.0000E+00
  PROCESSING INPUT.crashes_3
  THIS IS PARVMEC (PARALLEL VMEC), VERSION 9.0
  Lambda: Full Radial Mesh. L-Force: hybrid full/half.

  COMPUTER: cianciosaimac   OS: Darwin   RELEASE: 19.6.0  DATE = Jan 21,2021  TIME = 12:52:34

  NS =    8 NO. FOURIER MODES =  185 FTOLV =  1.000E-06 NITER =  20000
  PROCESSOR COUNT - RADIAL:    4
 INITIAL JACOBIAN CHANGED SIGN!
 TRYING TO IMPROVE INITIAL MAGNETIC AXIS GUESS
  ---- Improved AXIS Guess ----
      RAXIS_CC =    5.5423259209884730       0.30747882334706500        3.6107777297953697E-002   2.1925887832076173E-002 -0.17127515915757005       0.33995876393572677        2.7194580396712614E-002   8.7619938032124662E-003   2.1641584886036458E-002  -3.0060375964156970E-002   4.0919407891436034E-003   7.2283631622133112E-003  -4.8096045954452264E-003   3.2132317238919464E-003   1.3366337123433408E-003  -5.0218208257885189E-003  -1.0805539441867496E-003   3.8372284158438586E-004   1.2322391511445112E-003   8.2564184559682900E-004   9.0462982158830627E-003
      ZAXIS_CS =   -0.0000000000000000      -0.40364620347171476       -2.6212416249487239E-002   2.5845975128812093E-002  0.15344591155188636      -0.27210128536906603       -2.4819582171628708E-002  -7.6814873421304332E-003  -2.2282872186040290E-002   1.9170323502591072E-002  -1.1569841914854002E-002  -6.1298139436995875E-004  -2.6220827681052326E-003  -5.6155647985143900E-003  -3.0101401187663541E-003  -8.9905949402988867E-003  -4.8346291121438923E-003  -5.7954765825185117E-003   8.0075797167838414E-003  -3.0281697953424324E-003  -3.8957154711619243E-003
  -----------------------------
=================================================================
=================================================================
=================================================================
=================================================================
==55382==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x000102552fed bp 0x7ffeed759dc0 sp 0x7ffeed759db8
==55380==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010ec8dfed bp 0x7ffee101edc0 sp 0x7ffee101edb8
READ of size 8 at 0x6180000077c0 thread T0
READ of size 8 at 0x6180000077c0 thread T0
==55383==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010c003fed bp 0x7ffee3ca8dc0 sp 0x7ffee3ca8db8
READ of size 8 at 0x6180000077c0 thread T0
==55381==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6180000077c0 at pc 0x00010d992fed bp 0x7ffee2319dc0 sp 0x7ffee2319db8
READ of size 8 at 0x6180000077c0 thread T0
    #0 0x10ec8dfec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10f09af06 in runvmec_ runvmec.f:329
    #2 0x10ebdf804 in MAIN__ vmec.f:333
    #3 0x10ebe1818 in main vmec.f:2
    #0 0x10c003fec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10c410f06 in runvmec_ runvmec.f:329
    #2 0x10bf55804 in MAIN__ vmec.f:333
    #3 0x10bf57818 in main vmec.f:2
    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
allocated by thread T0 here:
    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
allocated by thread T0 here:
    #0 0x10d992fec in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2005
    #1 0x10dd9ff06 in runvmec_ runvmec.f:329
    #2 0x10d8e4804 in MAIN__ vmec.f:333
    #3 0x10d8e6818 in main vmec.f:2
    #0 0x113a341ad in wrap_malloc (libasan.5.dylib:x86_64+0x6c1ad)
    #1 0x10ec8d21b in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2002
    #2 0x10f09af06 in runvmec_ runvmec.f:329
    #3 0x10ebdf804 in MAIN__ vmec.f:333
    #4 0x10ebe1818 in main vmec.f:2
    #5 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

    #4 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

0x6180000077c0 is located 0 bytes to the right of 832-byte region [0x618000007480,0x6180000077c0)
SUMMARY: AddressSanitizer: heap-buffer-overflow blocktridiagonalsolver_bst.f90:2005 in __blocktridiagonalsolver_bst_MOD_initialize_bst
allocated by thread T0 here:
Shadow bytes around the buggy address:
  0x1c3000000ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1c3000000ef0: 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa
  0x1c3000000f00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c3000000f10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==55380==ABORTING

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x1136dc72c
#1  0x1136dbad3
#2  0x7fff6fdbf5fc
    #0 0x11128d1ad in wrap_malloc (libasan.5.dylib:x86_64+0x6c1ad)
    #1 0x10c00321b in __blocktridiagonalsolver_bst_MOD_initialize_bst blocktridiagonalsolver_bst.f90:2002
    #2 0x10c410f06 in runvmec_ runvmec.f:329
    #3 0x10bf55804 in MAIN__ vmec.f:333
    #4 0x10bf57818 in main vmec.f:2
    #5 0x7fff6fbc6cc8 in start (libdyld.dylib:x86_64+0x1acc8)

SUMMARY: AddressSanitizer: heap-buffer-overflow blocktridiagonalsolver_bst.f90:2005 in __blocktridiagonalsolver_bst_MOD_initialize_bst
Shadow bytes around the buggy address:
  0x1c3000000ea0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000eb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ec0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ed0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000ee0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x1c3000000ef0: 00 00 00 00 00 00 00 00[fa]fa fa fa fa fa fa fa
  0x1c3000000f00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
  0x1c3000000f10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x1c3000000f40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==55383==ABORTING

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x10d8f072c
#1  0x10d8efad3
#2  0x7fff6fdbf5fc
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 55380 on node cianciosaimac exited on signal 6 (Abort trap: 6).
--------------------------------------------------------------------------
cianciosa commented 3 years ago

Looking at some debugging information, of https://github.com/ORNL-Fusion/PARVMEC/blob/41375c75ca3e3e1305eddef6a70521abf82e8d1a/Sources/General/blocktridiagonalsolver_bst.f90#L2005

orig is alloced with a lower and upper bounds of 1 and 2. But globrowoff is trying to access the 3rd index.

cianciosa commented 3 years ago

Crash is happening because something is changing the size of startglobrow and endglobrow.

cianciosa commented 3 years ago

It looks like the crash happens because of the following sequence.

  1. Initialize_bst called with correct sizes
  2. eqsolve called
  3. evolve called
  4. jacobian changes sign
  5. evolve returns with bad jacobian flag.
  6. reset and retry eqsolve
  7. jacobian changes sign again
  8. evolve returns with bad jacobian flag
  9. exit from eqsolve
  10. next grid size is attempted.
  11. Initialize_bst called with incorrect sizes
cianciosa commented 3 years ago

This line causes the loop to return back to tag 50 and https://github.com/ORNL-Fusion/PARVMEC/blob/41375c75ca3e3e1305eddef6a70521abf82e8d1a/Sources/TimeStep/runvmec.f#L381

cianciosa commented 3 years ago

At the second attempt of the multigrid, the loop counter starts at index zero since jacobian_off is one.