OOM when using the current PyTorch git master

ELEKTRONN / elektronn3

A PyTorch-based library for working with 3D and 2D convolutional neural networks, with focus on semantic segmentation of volumetric biomedical image data

MIT License

160 stars 27 forks source link

OOM when using the current PyTorch git master #13

Closed mdraw closed 6 years ago

mdraw commented 6 years ago

7b33ef4 works fine, but newer revisions (I don't know since when exactly), training any network with elektronn3 leads to growing memory consumption with every training iteration until the GPU is out of memory. Maybe some operation inside of StoppableTrainer.train() is now accidentally accumulating gradients? I couldn't yet produce a minimal piece of training code that doesn't slowly eat up all memory.

mdraw commented 6 years ago

Good news: The OOM issue is caused by https://github.com/pytorch/pytorch/issues/6222, which should be fixed by https://github.com/pytorch/pytorch/pull/6230. I found out that replacing all MaxPool3d layers with AvgPool3d in our 3D models as a workaround fixed the growing memory usage problems, so it's safe to say we are experiencing the same MaxPool3d-specific bug.

mdraw commented 6 years ago

I can confirm this issue is now fixed since https://github.com/pytorch/pytorch/commit/de517641194a5ec70c58a2c274ccc2abdd2d8ec1.