IBM / mi-prometheus

Enabling reproducible Machine Learning research
http://mi-prometheus.rtfd.io/
Apache License 2.0
42 stars 18 forks source link

Memory 'leak' due to compounding number of loss tensors #111

Closed sesevgen closed 5 years ago

sesevgen commented 5 years ago

Describe the bug Every time loss is calculated, there remains an empty tensor in memory that doesn't get cleaned up. This eventually causes an 'out of memory' error in some instances.

To Reproduce Insert this function to a loss calculation call: def memReport(self): for obj in gc.get_objects(): if torch.is_tensor(obj): if len(obj.size()) == 0: print(type(obj), obj.size(), id(obj)) and you will see that there is one more tensor each iteration.

Here's an example (from simplecnn_mnist where the function sits in evaluate_loss of problem.py): New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666368017464 <class 'torch.Tensor'> torch.Size([]) 140666368017824 New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666366779328 <class 'torch.Tensor'> torch.Size([]) 140666368027168 <class 'torch.Tensor'> torch.Size([]) 140666368017824 New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666368027168 <class 'torch.Tensor'> torch.Size([]) 140666366779328 <class 'torch.Tensor'> torch.Size([]) 140666368017896 <class 'torch.Tensor'> torch.Size([]) 140666368017824 New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666368027168 <class 'torch.Tensor'> torch.Size([]) 140666368017896 <class 'torch.Tensor'> torch.Size([]) 140666366779328 <class 'torch.Tensor'> torch.Size([]) 140666368024296 <class 'torch.Tensor'> torch.Size([]) 140666368017824 New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666368027168 <class 'torch.Tensor'> torch.Size([]) 140666368017896 <class 'torch.Tensor'> torch.Size([]) 140666368024296 <class 'torch.Tensor'> torch.Size([]) 140666366779328 <class 'torch.Tensor'> torch.Size([]) 140666366834152 <class 'torch.Tensor'> torch.Size([]) 140666368017824 New cycle <class 'torch.Tensor'> torch.Size([]) 140667482493360 <class 'torch.Tensor'> torch.Size([]) 140666368027168 <class 'torch.Tensor'> torch.Size([]) 140666368017896 <class 'torch.Tensor'> torch.Size([]) 140666368024296 <class 'torch.Tensor'> torch.Size([]) 140666366834152 <class 'torch.Tensor'> torch.Size([]) 140666366779328 <class 'torch.Tensor'> torch.Size([]) 140666366834296 <class 'torch.Tensor'> torch.Size([]) 140666368017824

Desktop (please complete the following information):

Additional context There seems to be similar issues reported online when loss is used instead of loss.item() when logging values. Can this be our problem?

sesevgen commented 5 years ago

I think I have a fix.