Closed s-sajid-ali closed 2 years ago
This error comes not from synergia, but from mkl overriding fftw routines in an attempt to provide better performance, but segfaulting instead.
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffe00331ef libsynergia_distributed_fft.so`Distributed_fft3d::transform(this=0x00000000009d9e50, in=<unavailable>, out=0x00000000008fd110) at distributed_fft3d_fftw.cc:101:5
98 memcpy( (void*)data, (void*)&in(lower*plane_real),
99 nz * plane_real * sizeof(double) );
100
-> 101 fftw_execute(plan);
102
103 memcpy( (void*)&out(lower*plane_cplx*2), (void*)(workspace),
104 nz * plane_cplx * sizeof(double) * 2 );
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffe0032130 libsynergia_distributed_fft.so`fftw_execute
libsynergia_distributed_fft.so`fftw_execute:
-> 0x7fffe0032130 <+0>: jmpq *0x4f62(%rip) ; _GLOBAL_OFFSET_TABLE_ + 152
0x7fffe0032136 <+6>: pushq $0x10
0x7fffe003213b <+11>: jmp 0x7fffe0032020 ; ___lldb_unnamed_symbol187
libsynergia_distributed_fft.so`__cxa_throw:
0x7fffe0032140 <+0>: jmpq *0x4f5a(%rip) ; _GLOBAL_OFFSET_TABLE_ + 160
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a910 libmkl_intel_lp64.so.2`fftw_execute
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a910 <+0>: testq %rdi, %rdi
0x7fffed65a913 <+3>: je 0x7fffed65a91e ; <+14>
0x7fffed65a915 <+5>: movq 0x40(%rdi), %rax
0x7fffed65a919 <+9>: testq %rax, %rax
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a913 libmkl_intel_lp64.so.2`fftw_execute + 3
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a913 <+3>: je 0x7fffed65a91e ; <+14>
0x7fffed65a915 <+5>: movq 0x40(%rdi), %rax
0x7fffed65a919 <+9>: testq %rax, %rax
0x7fffed65a91c <+12>: jne 0x7fffed65a91f ; <+15>
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a915 libmkl_intel_lp64.so.2`fftw_execute + 5
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a915 <+5>: movq 0x40(%rdi), %rax
0x7fffed65a919 <+9>: testq %rax, %rax
0x7fffed65a91c <+12>: jne 0x7fffed65a91f ; <+15>
0x7fffed65a91e <+14>: retq
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a919 libmkl_intel_lp64.so.2`fftw_execute + 9
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a919 <+9>: testq %rax, %rax
0x7fffed65a91c <+12>: jne 0x7fffed65a91f ; <+15>
0x7fffed65a91e <+14>: retq
0x7fffed65a91f <+15>: jmpq *%rax
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a91c libmkl_intel_lp64.so.2`fftw_execute + 12
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a91c <+12>: jne 0x7fffed65a91f ; <+15>
0x7fffed65a91e <+14>: retq
0x7fffed65a91f <+15>: jmpq *%rax
0x7fffed65a921 <+17>: nopl (%rax,%rax)
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x00007fffed65a91f libmkl_intel_lp64.so.2`fftw_execute + 15
libmkl_intel_lp64.so.2`fftw_execute:
-> 0x7fffed65a91f <+15>: jmpq *%rax
0x7fffed65a921 <+17>: nopl (%rax,%rax)
0x7fffed65a929 <+25>: nopl (%rax)
libmkl_intel_lp64.so.2`fftw_execute_dft:
0x7fffed65a930 <+0>: subq $0x68, %rsp
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = instruction step into
frame #0: 0x0000000000e3eda0
-> 0xe3eda0: rolb %bl
(lldb) stepi
Process 9868 stopped
* thread #1, name = 'python', stop reason = signal SIGSEGV: address access protected (fault address: 0xe3eda0)
frame #0: 0x0000000000e3eda0
-> 0xe3eda0: rolb %bl
(lldb) register read bl
bl = 0x50
(lldb) ^D
Since I don't know how to prevent MKL from overriding FFTW routines, a short term fix is to LD_PRELOAD
the fftw
libraries, which prevents the above error.
sajid@LAPTOP-CDJT2P3R ~/p/s/build (devel3)> LD_PRELOAD=libfftw3.so python example.py
Kokkos::OpenMP::initialize WARNING: OMP_PROC_BIND environment variable not set
In general, for best performance with OpenMP 4.0 or better set OMP_PROC_BIND=spread and OMP_PLACES=threads
For best performance with OpenMP 3.1 set OMP_PROC_BIND=true
For unit testing set OMP_PROC_BIND=false
Propagator: starting turn 1, final turn 10
Propagator: turn 1/inf., time = 0.567s, macroparticles = (1048576) / ()
Propagator: turn 2/inf., time = 0.599s, macroparticles = (1048576) / ()
Propagator: turn 3/inf., time = 0.574s, macroparticles = (1048576) / ()
Propagator: turn 4/inf., time = 0.562s, macroparticles = (1048576) / ()
Propagator: turn 5/inf., time = 0.569s, macroparticles = (1048576) / ()
Propagator: turn 6/inf., time = 0.546s, macroparticles = (1048576) / ()
Propagator: turn 7/inf., time = 0.555s, macroparticles = (1048576) / ()
Propagator: turn 8/inf., time = 0.560s, macroparticles = (1048576) / ()
Propagator: turn 9/inf., time = 0.572s, macroparticles = (1048576) / ()
Propagator: turn 10/inf., time = 0.615s, macroparticles = (1048576) / ()
Propagator: maximum number of turns reached
Propagator: total time = 5.870s
Traceback (most recent call last):
File "/home/sajid/packages/synergia2/build/example.py", line 99, in main
run()
File "/home/sajid/packages/synergia2/build/example.py", line 94, in run
synergia.simulation.checkpoint_save(propagator, sim)
RuntimeError: Trying to save an unregistered polymorphic type (Space_charge_3d_open_hockney_options).
Make sure your type is registered with CEREAL_REGISTER_TYPE and that the archive you are using was included (and registered with CEREAL_REGISTER_ARCHIVE) prior to calling CEREAL_REGISTER_TYPE.
If your type is already registered and you still see this error, you may need to use CEREAL_REGISTER_DYNAMIC_INIT.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sajid/packages/synergia2/build/example.py", line 103, in <module>
main()
File "/home/sajid/packages/synergia2/build/example.py", line 101, in main
raise RuntimeError("Failure to launch fodo.run")
RuntimeError: Failure to launch fodo.run
libc++abi: terminating with uncaught exception of type std::runtime_error: Kokkos allocation "particles_discards" is being deallocated after Kokkos::finalize was called
fish: Job 1, 'LD_PRELOAD=libfftw3.so python e…' terminated by signal SIGABRT (Abort)
sajid@LAPTOP-CDJT2P3R ~/p/s/build (devel3) [SIGABRT]>
Of interest: https://www.intel.com/content/www/us/en/develop/documentation/onemkl-linux-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/avoiding-conflicts-in-the-execution-environment.html
When running the example on the homepage, I encountered the following segfault:
This segfault is transient and does not always occur.