GalSim-developers / GalSim

The modular galaxy image simulation toolkit. Documentation:
http://galsim-developers.github.io/GalSim/
Other
224 stars 105 forks source link

Fix segmentation faults in the Silicon class #1224

Closed rmjarvis closed 1 year ago

rmjarvis commented 1 year ago

@cwwalter was getting occasional segmentation faults using GalSim main to do some imSim work, stemming from the Silicon class, which this PR partially fixes. (It fixes them for CPU runs, but not completely for GPU runs.)

It turns out there were two serious problems with the current implementation related to calculating the number of pixels to work on:

  1. The number of pixels was sometimes calculated as int imageDataSize = _delta.getMaxPtr() - _delta.getData();. This is sometimes correct, but it's not reliable, since _delta gets resized from time to time. When the size decreases, we don't deallocate and reallocate. So getMaxPtr() can be much larger than the number of pixels currently being worked on. This was the case in what Chris was doing, so some loops ended up writing many more values than the image stored, overrunning the allocated data. Hence SIGSEGV.
  2. Even when calculating the number of pixel correctly (from NCol * NRow), which it also did in various places, the addDelta function was wrongly assuming that the target image is contiguous in memory. This doesn't have to be the case. It's common for it to be a subImage of a larger image. And the code should also work when the image is flipped or rotated 180 degrees, so step and stride may be negative. This error could also potentially cause seg faults, but mostly it just did the wrong thing when the target was not contiguous data.

I fixed both problems for CPU execution. But I'm pretty confident that some of the GPU syncing is still not correct when the target image is not contiguous. I marked these with FIXME comments.

@jamesp-epcc Can you please fix those remaining GPU commands to make them work when step!=1? And especially when step or stride < 0, which seems particularly tricky to get right. I don't have easy access to a GPU machine to test on.

jamesp-epcc commented 1 year ago

Had a look at this today but I think I will have to look into it more deeply next week. I fixed a few minor problems (a stray comma in line 1134, missing targetData pointer in addDelta) and was able to get it to build for GPU, but it's crashing when running the test suite. I'm seeing a case (specifically, the 4th time initialize gets called during test_silicon) where the pointer returned by getMaxPtr() is less than the one returned by getData(), so nmem is negative and GPU memory allocation fails. I will need to rethink how the allocation is handled and come up with a method that works for all cases (including non-contiguous cases).

jamesp-epcc commented 1 year ago

OK, I have this working on GPU now. Essentially the entire memory range of the image (at least the part that is used in the current view) needs to be mapped to GPU, so it's necessary to work out the minimum and maximum pointers that can ever be accessed, then map the range between them. The same range needs to be used when updating the target image on the host from the GPU, and also when removing the mappings at the end.

The working code is here: https://github.com/jamesp-epcc/GalSim/tree/silicon_segfault . I have modified Silicon.cpp and Silicon.h but nothing else.

rmjarvis commented 1 year ago

Those changes look fine to me. If the unit tests are passing on the GPU, you should merge them into this branch.

jamesp-epcc commented 1 year ago

Yes, the full sensor test suite passes on GPU with these changes. I have merged them into this branch.

rmjarvis commented 1 year ago

@jamesp-epcc I'm assuming you already looked through the code here to make your changes, but if you could do a code review of this, that would be great!

jamesp-epcc commented 1 year ago

I'd looked through most of it in the course of my debugging, but I just had a look over the whole thing. The only thing I noticed is that subtractDelta no longer copies the data back from the GPU like addData does. This doesn't cause any test failures just now so it may not strictly be needed, but it seems sensible to keep the behaviour consistent in case it makes a difference later on.

rmjarvis commented 1 year ago

Good catch. Thanks!