When saving checkpoints the amount of (CPU) RAM memory used increases every time. It seems that the garbage collector doesn't free the unreferenced memory. Copying each Tensor directly to the (CPU) RAM fixed the problem for me. I think issue #109 could be due to this. When the occupied memory grows, the process freezes and the OS kills the process, just like @DmitryUlyanov said.
Commit d4f53da saves some more memory. Useful if you have big models and not so much RAM available.
When saving checkpoints the amount of (CPU) RAM memory used increases every time. It seems that the garbage collector doesn't free the unreferenced memory. Copying each Tensor directly to the (CPU) RAM fixed the problem for me. I think issue #109 could be due to this. When the occupied memory grows, the process freezes and the OS kills the process, just like @DmitryUlyanov said. Commit d4f53da saves some more memory. Useful if you have big models and not so much RAM available.