comet-ml / issue-tracking

Questions, Help, and Issues for Comet ML
https://www.comet.ml
85 stars 7 forks source link

Memory leak when Logging 3d histogram PyTorch tensor on GPU #446

Closed ianpegg-bc closed 11 months ago

ianpegg-bc commented 2 years ago

Describe the Bug

The application runs out of memory and is killed attempting to log_histogram_3d with a Pytorch tensor on the GPU.

Expected behavior

Either of the following behaviors would be acceptable:

Where is the issue?

To Reproduce

import comet_ml
import torch

assert torch.cuda.is_available()
experiment = comet_ml.Experiment(project_name="test")

device = 'cuda'
# device = 'cpu'
x = torch.rand(100, device=device)

experiment.set_step(0)
experiment.log_histogram_3d(x, "x")

The issue goes away when you set device='cpu'

Stack Trace

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)

stack trace if I stop it mid memory leak:

Traceback (most recent call last):
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1537, in fast_flatten
    items = numpy.array(items, dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 725, in __array__
    return self.numpy().astype(dtype, copy=False)
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in fast_flatten
    items = numpy.array([numpy.array(item) for item in items], dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1543, in <listcomp>
    items = numpy.array([numpy.array(item) for item in items], dtype=float)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 723, in __array__
    return self.numpy()
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ian.pegg/projects/shining_software/src/shining_research/map_divergence_detection/debug.py", line 12, in <module>
    experiment.log_histogram_3d(x, "x")
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/experiment.py", line 2861, in log_histogram_3d
    histogram.add(values)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 956, in add
    values = fast_flatten(values)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1550, in fast_flatten
    return numpy.array(flatten(items))
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1518, in flatten
    return list(lazy_flatten(items))
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/comet_ml/utils.py", line 1503, in lazy_flatten
    new_iterator = iter(value)
  File "/home/ian.pegg/miniconda3/envs/torch-nightly/lib/python3.9/site-packages/torch/_tensor.py", line 688, in __iter__
    if torch._C._get_tracing_state():
KeyboardInterrupt

Link to Comet Project/Experiment

https://www.comet.ml/ianpegg-bc/test

DN6 commented 2 years ago

Thanks for catching this @ianpegg-bc. I'll have our engineering team look into this.

DN6 commented 2 years ago

@ianpegg-bc Following up here. I've created a ticket to for the engineering team to address this. In the mean time, the work around would be move the tensor to CPU before logging it as a histogram, as you have suggested.