cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
60 stars 32 forks source link

Add Compute sanitizer to automated tests #247

Closed bcaddy closed 7 months ago

bcaddy commented 1 year ago

I'd like to add some compute sanitizer runs to our automated testing at some point to catch potential errors in the CUDA code. To do this we would need to address the issues raised in #197 and have extremely small test problems to run on for each build type.

There's a script for running the compute sanitizer in the tools directory: tools/cholla-nv-compute-sanitizer.sh

Issue #197

I thInk I've resolved the init check issues here but not the rest. We should check and get those issues resolved

Small Test Problems

To run the compute sanitizer we need to actually run Cholla on some problem. The trick is that some of the compute sanitizer checks, namely mem check, cause the code to run crazy slow; of order 30s per time step. Since most of the code is the same from time step to time step we need simple problems for each build type that only run for a handful of time steps. In the case of hydro this could be a very low resolution Sod tube with a limited run time. A similar MHD shock tube would work for MHD but I don't know about the other build types.

bcaddy commented 1 year ago

Maybe instead of having really short test problems we can but in an early exit? Either a runtime or compile time option that has cholla exit gracefully after 5 time steps or something.

Edit: We already have this in the form of the N_STEPS_LIMIT macro

bcaddy commented 1 year ago

Currently the init checks have been "solved" by just initializing all memory when allocated. This doesn't actually solve the problem of reading uninitialized memory though, it just hides it.

bcaddy commented 1 year ago

@evaneschneider

FYI, the CUDA compute sanitizer can find unused allocations (assuming we remove the bulk initialization). A quick inspection shows that the integrator intermediate arrays (flux, interfaces, etc) are underutilized by 1-10% depending on the array. With some clever indexing we could definitely free up some memory.