astro-informatics / purify

Next-generation radio interferometric imaging.
https://astro-informatics.github.io/purify
GNU General Public License v2.0
17 stars 13 forks source link

Segfault at the end of main.cc #93

Closed Luke-Pratley closed 7 years ago

Luke-Pratley commented 7 years ago

The segfault that was seen when running PURIFY in South Africa, I am currently seeing it when trying to develop imaging of multiple channels.

I am going to try to debug and solve the problem, but it might be difficult. Since it occurs at the end of the program, it might be that some memory is not being free'd, or something is wrong with a deconstructor.

adrianjhpc commented 7 years ago

Are you getting a core file from the segfault?

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 7 years ago

@adrianjhpc No, there is no core file produced from the segfault, and it says nothing about it in the error. Do you know what this would mean?

I get something along the lines of

[1]    28750 segmentation fault  ./purify --measurement_set ../data/vla/at166B.3C129.c0.ms --imsize 256 --name
adrianjhpc commented 7 years ago

Try:

ulimit -c unlimited

before you run purify and see if it generates a core file then. This is assuming you're running this on the command line.

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 7 years ago

@adrianjhpc Cool, now a core file is dumped into /cores/. But, how can I use it?

adrianjhpc commented 7 years ago

You can now use a debugger like gdb to view the core file.

So:

gdb ./purify cores/nameofcorefile

Should load up the debugger with the core file, then you can type:

backtrace

and it should show you where it is seg faulting.

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 7 years ago

Thanks, that helped quite alot. I also started to use lldb (gdb on mac was having problems that were taking too long to solve) and set break points.

The output at the the crash is

Process 51541 stopped
* thread #1: tid = 0x229684, 0x000000010000b389 purify`main(argc=<unavailable>, argv=<unavailable>) + 23865 at main.cc:364, queue = 'com.apple.main-thread', stop reason = step over
    frame #0: 0x000000010000b389 purify`main(argc=<unavailable>, argv=<unavailable>) + 23865 at main.cc:364
   361      if (params.run_diagnostic)
   362        out_diagnostic.close();
   363      PURIFY_HIGH_LOG("Plane {} finished!", channel_number + 1);
-> 364    }
   365    PURIFY_HIGH_LOG("All planes imaged!");
   366    return 0;
   367  }
(lldb) n
Process 51541 stopped
* thread #1: tid = 0x229684, 0x000000010041f492 libfftw3.3.dylib`fftw_plan_awake + 19, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
    frame #0: 0x000000010041f492 libfftw3.3.dylib`fftw_plan_awake + 19
libfftw3.3.dylib`fftw_plan_awake:
->  0x10041f492 <+19>: movq   0x8(%rax), %rax
    0x10041f496 <+23>: callq  *%rax
    0x10041f498 <+25>: movl   %ebp, 0x30(%rbx)
    0x10041f49b <+28>: popq   %rax

I found that the cause of the problem looks like it is related to fftw. It happens when the fftw related things should go out of scope, which suggests it could be something related to the fftw plans. I will have to try making sure the plans are destroyed and cleaned up. I guess fftw was written in C, so it is not going to remove pointers. The other problem could be that it is referencing something to do with fftw, but it has already gone out of scope. I am still unsure of what the actual problem is.

adrianjhpc commented 7 years ago

Are you using the fftw threading functionality? If so, it could be that the cleanup stuff is being called by one thread before another thread is finished running some of the fftw stuff.

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 7 years ago

@adrianjhpc Yeah, I have been trying to see if it is related to that. If I don't use threads, I think I get the same error.

From what I can tell, I think it is something related to how the plans are stored within Eigen, and what happens when they are destroyed from memory. Some how something is going wrong when the object holding the plans gets destroyed.

adrianjhpc commented 7 years ago

It almost seems like the plans are getting destroyed before the fftw routines haven't finished.

-- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Luke-Pratley commented 7 years ago

@adrianjhpc So, it turns out the map containing the plans was being destroyed twice... I also found the cause.

The MeasurementOperator contains its own FFTOperator. The FFTOperator inherits from the Eigen FFT class, and stores its plans using an std::map. When the measurement operator goes out of scope, it will have to deconstruct the the FFT operator. When this happens, the plans are destroyed in the std::map.

However, in that function I was passing a non-constant non-reference of the measurement operator, and it was being deconstructed at the end of the function. Then, it was being deconstructed again later. So, some how it was like it was going out of scope twice... which caused the plans to be destroyed twice... I don't completely understand how this can happen the way it did though.