fnalacceleratormodeling / synergia2

Synergia is a accelerator modeling and simulation package developped at Fermilab.
8 stars 4 forks source link

Segfault in fftw_execute when running example on homepage #29

Closed s-sajid-ali closed 2 years ago

s-sajid-ali commented 2 years ago

When running the example on the homepage, I encountered the following segfault:

Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Propagator: starting turn 1, final turn 10

Process 8406 stopped
* thread #1, name = 'python', stop reason = signal SIGSEGV: invalid address (fault address: 0x30)
    frame #0: 0x0000000000000030
error: memory read failed for 0x0
(lldb) bt 5
* thread #1, name = 'python', stop reason = signal SIGSEGV: invalid address (fault address: 0x30)
  * frame #0: 0x0000000000000030
    frame #1: 0x00007fffc802e2dd libsynergia_distributed_fft.so`Distributed_fft3d::inv_transform(this=0x00000000011eadb0, in=<unavailable>, out=0x0000000000763088) at distributed_fft3d_fftw.cc:111:3
    frame #2: 0x00007fffa9d6b9e4 libsynergia_collective.so`Space_charge_3d_open_hockney::get_local_phi2(this=0x0000000000762ea0, fft=0x00000000011eadb0) at space_charge_3d_open_hockney.cc:908:7
    frame #3: 0x00007fffa9d69f48 libsynergia_collective.so`Space_charge_3d_open_hockney::apply_bunch(this=0x0000000000762ea0, bunch=0x0000000001168e90, fft=0x00000000011eadb0, time_step=6.9899537156101739E-9, logger=<unavailable>) at space_charge_3d_open_hockney.cc:663:3
    frame #4: 0x00007fffa9d691eb libsynergia_collective.so`Space_charge_3d_open_hockney::apply_impl(this=0x0000000000762ea0, sim=0x00000000003dfbb0, time_step=6.9899537156101739E-9, logger=<unavailable>) at space_charge_3d_open_hockney.cc:637:7

This segfault is transient and does not always occur.

s-sajid-ali commented 2 years ago

This error comes not from synergia, but from mkl overriding fftw routines in an attempt to provide better performance, but segfaulting instead.

(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffe00331ef libsynergia_distributed_fft.so`Distributed_fft3d::transform(this=0x00000000009d9e50, in=<unavailable>, out=0x00000000008fd110) at distributed_fft3d_fftw.cc:101:5
   98       memcpy( (void*)data, (void*)&in(lower*plane_real),
   99               nz * plane_real * sizeof(double) );
-> 101      fftw_execute(plan);
   103      memcpy( (void*)&out(lower*plane_cplx*2), (void*)(workspace),
   104              nz * plane_cplx * sizeof(double) * 2 );
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffe0032130 libsynergia_distributed_fft.so`fftw_execute
->  0x7fffe0032130 <+0>:  jmpq   *0x4f62(%rip)             ; _GLOBAL_OFFSET_TABLE_ + 152
    0x7fffe0032136 <+6>:  pushq  $0x10
    0x7fffe003213b <+11>: jmp    0x7fffe0032020            ; ___lldb_unnamed_symbol187

    0x7fffe0032140 <+0>:  jmpq   *0x4f5a(%rip)             ; _GLOBAL_OFFSET_TABLE_ + 160
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a910 libmkl_intel_lp64.so.2`fftw_execute
->  0x7fffed65a910 <+0>: testq  %rdi, %rdi
    0x7fffed65a913 <+3>: je     0x7fffed65a91e            ; <+14>
    0x7fffed65a915 <+5>: movq   0x40(%rdi), %rax
    0x7fffed65a919 <+9>: testq  %rax, %rax
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a913 libmkl_intel_lp64.so.2`fftw_execute + 3
->  0x7fffed65a913 <+3>:  je     0x7fffed65a91e            ; <+14>
    0x7fffed65a915 <+5>:  movq   0x40(%rdi), %rax
    0x7fffed65a919 <+9>:  testq  %rax, %rax
    0x7fffed65a91c <+12>: jne    0x7fffed65a91f            ; <+15>
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a915 libmkl_intel_lp64.so.2`fftw_execute + 5
->  0x7fffed65a915 <+5>:  movq   0x40(%rdi), %rax
    0x7fffed65a919 <+9>:  testq  %rax, %rax
    0x7fffed65a91c <+12>: jne    0x7fffed65a91f            ; <+15>
    0x7fffed65a91e <+14>: retq
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a919 libmkl_intel_lp64.so.2`fftw_execute + 9
->  0x7fffed65a919 <+9>:  testq  %rax, %rax
    0x7fffed65a91c <+12>: jne    0x7fffed65a91f            ; <+15>
    0x7fffed65a91e <+14>: retq
    0x7fffed65a91f <+15>: jmpq   *%rax
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a91c libmkl_intel_lp64.so.2`fftw_execute + 12
->  0x7fffed65a91c <+12>: jne    0x7fffed65a91f            ; <+15>
    0x7fffed65a91e <+14>: retq
    0x7fffed65a91f <+15>: jmpq   *%rax
    0x7fffed65a921 <+17>: nopl   (%rax,%rax)
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x00007fffed65a91f libmkl_intel_lp64.so.2`fftw_execute + 15
->  0x7fffed65a91f <+15>: jmpq   *%rax
    0x7fffed65a921 <+17>: nopl   (%rax,%rax)
    0x7fffed65a929 <+25>: nopl   (%rax)

    0x7fffed65a930 <+0>:  subq   $0x68, %rsp
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
    frame #0: 0x0000000000e3eda0
->  0xe3eda0: rolb   %bl
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = signal SIGSEGV: address access protected (fault address: 0xe3eda0)
    frame #0: 0x0000000000e3eda0
->  0xe3eda0: rolb   %bl
(lldb) register read bl
      bl = 0x50
(lldb) ^D
s-sajid-ali commented 2 years ago

Since I don't know how to prevent MKL from overriding FFTW routines, a short term fix is to LD_PRELOAD the fftw libraries, which prevents the above error.

sajid@LAPTOP-CDJT2P3R ~/p/s/build (devel3)> LD_PRELOAD=libfftw3.so python example.py
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
  In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
  For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
  For unit testing set OMP_PROC_BIND=false
Propagator: starting turn 1, final turn 10

Propagator: turn 1/inf., time = 0.567s, macroparticles = (1048576) / ()
Propagator: turn 2/inf., time = 0.599s, macroparticles = (1048576) / ()
Propagator: turn 3/inf., time = 0.574s, macroparticles = (1048576) / ()
Propagator: turn 4/inf., time = 0.562s, macroparticles = (1048576) / ()
Propagator: turn 5/inf., time = 0.569s, macroparticles = (1048576) / ()
Propagator: turn 6/inf., time = 0.546s, macroparticles = (1048576) / ()
Propagator: turn 7/inf., time = 0.555s, macroparticles = (1048576) / ()
Propagator: turn 8/inf., time = 0.560s, macroparticles = (1048576) / ()
Propagator: turn 9/inf., time = 0.572s, macroparticles = (1048576) / ()
Propagator: turn 10/inf., time = 0.615s, macroparticles = (1048576) / ()
Propagator: maximum number of turns reached
Propagator: total time = 5.870s
Traceback (most recent call last):
  File "/home/sajid/packages/synergia2/build/example.py", line 99, in main
  File "/home/sajid/packages/synergia2/build/example.py", line 94, in run
    synergia.simulation.checkpoint_save(propagator, sim)
RuntimeError: Trying to save an unregistered polymorphic type (Space_charge_3d_open_hockney_options).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/sajid/packages/synergia2/build/example.py", line 103, in <module>
  File "/home/sajid/packages/synergia2/build/example.py", line 101, in main
    raise RuntimeError("Failure to launch fodo.run")
RuntimeError: Failure to launch fodo.run
libc++abi: terminating with uncaught exception of type std::runtime_error: Kokkos allocation "particles_discards" is being deallocated after Kokkos::finalize was called

fish: Job 1, 'LD_PRELOAD=libfftw3.so python e…' terminated by signal SIGABRT (Abort)
sajid@LAPTOP-CDJT2P3R ~/p/s/build (devel3) [SIGABRT]>
s-sajid-ali commented 2 years ago

Of interest: https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/avoiding-conflicts-in-the-execution-environment.html