Out of memory errors don't work properly

CFD-GO / TCLB

TCLB - Templated MPI+CUDA/CPU Lattice Boltzmann code

https://tclb.io

GNU General Public License v3.0

177 stars 70 forks source link

Out of memory errors don't work properly #496

Closed shkodm closed 7 months ago

shkodm commented 7 months ago

Found on develop branch when trying to run large example (automatically generated). Can, for instance, can be reproduced:

256 x 512 x 256: [ ] [0] Cumulative allocation of 168231424 b (30.2 GB) Works fine 256 x 800 x 256: Cumulative allocation of -24577536 b (18446744073.7 GB) Throws out of memory error correctly, but incorrect reporting of total allocated memory. 512 x 800 x 256: Tries to allocated incorect (much smaller) total amount of memory and throws a different error:

Cumulative allocation of -154486272 b (25.6 GB)
[  ] Initializing Lattice ...
[ ] FATAL ERROR: an illegal memory access was encountered in Lattice.hpp at line 445

Expected behaviour: correct memory allocation is reported, and correct error is throws (the behaviour of master branch). Probably related to some casting or overflow

llaniewski commented 7 months ago

Seems the problem was introduced here: https://github.com/CFD-GO/TCLB/commit/101e6c2c1912ca8cc97b8402eababf23e548cc28#diff-d31965790d0025ccd455cffbe6c4c9fdcdcc33946b62479b634f87f9d8a574f9R197 when @kubagalecki deleted (size_t) conversion in the calculation of size. I'll make a pull request to fix this (and fix the printing at the same time)

shkodm commented 7 months ago

@llaniewski it is probably a different bug, but some things still don't work as expected (also the same on master branch). I run on 2 V100 on Bunya, each with 80GB GPUs, my case is large, so I split between 2. I get: Cumulative allocation of 63.GB) and then an illegal memory access was encountered in Lattice.hpp at line 279

The error is the same even if try I split between 3 GPUs (40GB each, so plenty of space even if there is some unaccounted memory)

llaniewski commented 7 months ago

@shkodm Just to clarify, do these large cases run on the master branch?

shkodm commented 7 months ago

@llaniewski no, they also don't work on master branch. The error happens in the Lattice.cu at the same place (CUDA kernel synchronisation). I ran with d3q27_pf_velocity model.

llaniewski commented 7 months ago

Closing this issue and moving the discussion of the size limitation to #499 . Addressing it is a bigger thing and wound need testing of performance.