CFD-GO / TCLB

TCLB - Templated MPI+CUDA/CPU Lattice Boltzmann code
https://tclb.io
GNU General Public License v3.0
177 stars 70 forks source link

Failcheck always triggers for arbitrary grid #476

Open kubagalecki opened 7 months ago

kubagalecki commented 7 months ago

The failcheck handler checks for early termination by detecting NaN values in the computed quantities. This is reasonable for the Cartesian grid, since quantities are well defined in the entire domain. However, for the arbitrary grid, quantities in the bulk region will usually have NaN values. This is due to the fact that computing quantities involves accessing neighboring nodes' values, and by their very nature bulk region nodes' neighbors often don't exist. This results in NaN values working their way into the quantity computation. To be clear, this is well defined behavior in accordance with the design, there's no risk of segfaults or other crashes. However, in the straightforward implementation of the failcheck handler, this will result in false positives and early termination of a correct simulation. I see the following solutions to this problem:

  1. Just don't use the failcheck handler when running with arbitrary grid :shit:
  2. Use the failcheck handler with quantities which don't involve accessing neighbors (e.g. raw field values). This would work under the current implementation (in my fork), but requires slight modifications to existing xmls and adding new synthetic quantities just for the purpose of failchecking.
  3. Just check for NaNs in the raw field values without computing any quantities. This has the added benefit of being a device-side reduction (the host only receives a single flag indicating whether a NaN was present), but changes the semantics of the failchecker.
  4. Explicitly set the quantity values for the bulk region. This requires slightly modifying existing models, e.g., for d2q9 getU() would contain if(IamWall) return {0., 0., 0.};. Not sure how this fits with the existing body of work, but it has the benefit of yielding more meaningful results downstream, i.e., no paraview thresholding for NaNs would be needed.
  5. Ignore NaN values in the bulk region. This is easier said than done, since the notion of a bulk region is only well-defined at the stage of generating the arbitrary grid input files, when running later on this would involve detecting nodes with missing neighbors. This is probably the most robust solution (it doesn't involve any modification of the existing models and .xml files), but it requires a bit of work (would probably have to push this bit to next year).
  6. Deeper changes to the arbitrary grid formulation, where bulk nodes are entirely omitted from the results and constitute a sort of hidden state during computation. Personally not a fan of this one.

CC: @llaniewski @TravisMitchell