Closed mdraw closed 6 years ago
Good news: The OOM issue is caused by https://github.com/pytorch/pytorch/issues/6222, which should be fixed by https://github.com/pytorch/pytorch/pull/6230. I found out that replacing all MaxPool3d
layers with AvgPool3d
in our 3D models as a workaround fixed the growing memory usage problems, so it's safe to say we are experiencing the same MaxPool3d
-specific bug.
I can confirm this issue is now fixed since https://github.com/pytorch/pytorch/commit/de517641194a5ec70c58a2c274ccc2abdd2d8ec1.
7b33ef4 works fine, but newer revisions (I don't know since when exactly), training any network with elektronn3 leads to growing memory consumption with every training iteration until the GPU is out of memory. Maybe some operation inside of
StoppableTrainer.train()
is now accidentally accumulating gradients? I couldn't yet produce a minimal piece of training code that doesn't slowly eat up all memory.