exasmr / openmc

OpenMC Monte Carlo Code
https://docs.openmc.org
MIT License
15 stars 9 forks source link

Fix for phantom seg fault on A100 and Titan V GPUs #16

Closed jtramm closed 1 year ago

jtramm commented 1 year ago

This PR fixes a seg fault/CUDA alignment issue that had been causing crashes on the the SXM A100 (ANL Polaris) and Titan V GPUs when compiled with LLVM.

The issue was caused here in thermal.h:

  struct Reaction {
    // Default constructor
    Reaction() { }

    // Data members
    std::unique_ptr<Function1DFlatContainer> xs; //!< Cross section
    std::unique_ptr<AngleEnergyFlatContainer> distribution; //!< Secondary angle-energy distribution
    Function1DFlatContainer* device_xs;
    AngleEnergyFlatContainer* device_distribution;
  };

, specifically due to the lack of initialization in the device_xs field. This issue is triggered when we map the reaction data to device as:

void ThermalScattering::copy_to_device()
{
  data_.copy_to_device();
  for (auto& d : data_) {
    if (d.elastic_.xs) {
      d.elastic_.device_xs = d.elastic_.xs.get();
      d.elastic_.device_distribution = d.elastic_.distribution.get();
      #pragma omp target enter data map(to: d.elastic_.device_xs[:1])
      #pragma omp target enter data map(to: d.elastic_.device_distribution[:1])
      d.elastic_.device_xs->copy_to_device();
      d.elastic_.device_distribution->copy_to_device();
    }

    d.inelastic_.device_xs = d.inelastic_.xs.get();
    d.inelastic_.device_distribution = d.inelastic_.distribution.get();
    #pragma omp target enter data map(to: d.inelastic_.device_xs[:1])
    #pragma omp target enter data map(to: d.inelastic_.device_distribution[:1])
    d.inelastic_.device_xs->copy_to_device();
    d.inelastic_.device_distribution->copy_to_device();
  }
  kTs_.copy_to_device();
}

The problem stems from the lines:

    if (d.elastic_.xs) {
      d.elastic_.device_xs = d.elastic_.xs.get();

where we are only initializing the device_xs pointer to something if there is actually data that needs to be mapped. Later, during transport, this pointer will be tested to see if it is null or not to determine if we need to process those reactions. If the data is not present, then the pointer didn't get initialized to anything, so we will end up reading from an uninitialized pointer location. Most systems that is going to be 0/null, which is why it was not causing problems on most of the cards we had been developing on. But on some cards it ended up being consistently a "1" value, so we'd read from a garbage location causing the seg fault.

The fix is very easy -- we can just amend the fields to be initialized to {nullptr} in thermal.h. I also went ahead to and initialized pointers that are called device_... elsewhere in the code, even though this is probably not needed, just to ensure we don't have another seg fault like this lurking out there. Some of the cases I've initialized to nullptr I think definitely don't need it due to being globals, but figured it doesn't hurt to make things explicit as it could make grepping for such issues easier down the line.