Closed Luke-Pratley closed 7 years ago
Are you getting a core file from the segfault?
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
@adrianjhpc No, there is no core file produced from the segfault, and it says nothing about it in the error. Do you know what this would mean?
I get something along the lines of
[1] 28750 segmentation fault ./purify --measurement_set ../data/vla/at166B.3C129.c0.ms --imsize 256 --name
Try:
ulimit -c unlimited
before you run purify and see if it generates a core file then. This is assuming you're running this on the command line.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
@adrianjhpc Cool, now a core file is dumped into /cores/
. But, how can I use it?
You can now use a debugger like gdb to view the core file.
So:
gdb ./purify cores/nameofcorefile
Should load up the debugger with the core file, then you can type:
backtrace
and it should show you where it is seg faulting.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Thanks, that helped quite alot. I also started to use lldb (gdb on mac was having problems that were taking too long to solve) and set break points.
The output at the the crash is
Process 51541 stopped
* thread #1: tid = 0x229684, 0x000000010000b389 purify`main(argc=<unavailable>, argv=<unavailable>) + 23865 at main.cc:364, queue = 'com.apple.main-thread', stop reason = step over
frame #0: 0x000000010000b389 purify`main(argc=<unavailable>, argv=<unavailable>) + 23865 at main.cc:364
361 if (params.run_diagnostic)
362 out_diagnostic.close();
363 PURIFY_HIGH_LOG("Plane {} finished!", channel_number + 1);
-> 364 }
365 PURIFY_HIGH_LOG("All planes imaged!");
366 return 0;
367 }
(lldb) n
Process 51541 stopped
* thread #1: tid = 0x229684, 0x000000010041f492 libfftw3.3.dylib`fftw_plan_awake + 19, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
frame #0: 0x000000010041f492 libfftw3.3.dylib`fftw_plan_awake + 19
libfftw3.3.dylib`fftw_plan_awake:
-> 0x10041f492 <+19>: movq 0x8(%rax), %rax
0x10041f496 <+23>: callq *%rax
0x10041f498 <+25>: movl %ebp, 0x30(%rbx)
0x10041f49b <+28>: popq %rax
I found that the cause of the problem looks like it is related to fftw. It happens when the fftw related things should go out of scope, which suggests it could be something related to the fftw plans. I will have to try making sure the plans are destroyed and cleaned up. I guess fftw was written in C, so it is not going to remove pointers. The other problem could be that it is referencing something to do with fftw, but it has already gone out of scope. I am still unsure of what the actual problem is.
Are you using the fftw threading functionality? If so, it could be that the cleanup stuff is being called by one thread before another thread is finished running some of the fftw stuff.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
@adrianjhpc Yeah, I have been trying to see if it is related to that. If I don't use threads, I think I get the same error.
From what I can tell, I think it is something related to how the plans are stored within Eigen, and what happens when they are destroyed from memory. Some how something is going wrong when the object holding the plans gets destroyed.
It almost seems like the plans are getting destroyed before the fftw routines haven't finished.
-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
@adrianjhpc So, it turns out the map containing the plans was being destroyed twice... I also found the cause.
The MeasurementOperator contains its own FFTOperator. The FFTOperator inherits from the Eigen FFT class, and stores its plans using an std::map. When the measurement operator goes out of scope, it will have to deconstruct the the FFT operator. When this happens, the plans are destroyed in the std::map.
However, in that function I was passing a non-constant non-reference of the measurement operator, and it was being deconstructed at the end of the function. Then, it was being deconstructed again later. So, some how it was like it was going out of scope twice... which caused the plans to be destroyed twice... I don't completely understand how this can happen the way it did though.
The segfault that was seen when running PURIFY in South Africa, I am currently seeing it when trying to develop imaging of multiple channels.
I am going to try to debug and solve the problem, but it might be difficult. Since it occurs at the end of the program, it might be that some memory is not being free'd, or something is wrong with a deconstructor.