torch.cuda.OutOfMemoryError: CUDA out of memory

hihxy commented 4 months ago

Thanks for proposing such a great method! I've been trying to replicate your code recently, but I've run into some issues. CUDA out of memory error occurred during the model inference step. 200 images with a video resolution of 854x476.The GPU used was an NVIDIA RTX3090 with a video memory size of 24GB. The issue also occurred during the training step.

50%|█████     | 5001/10001 [50:55<50:55,  1.64it/s]
Traceback (most recent call last):
  File "/data/hxy/dino-tracker/./train.py", line 16, in <module>
    dino_tracker.train()
  File "/data/hxy/dino-tracker/dino_tracker.py", line 415, in train
    consistent_track_loss = self.get_cycle_consistency_loss(model, inputs)
  File "/data/hxy/dino-tracker/dino_tracker.py", line 347, in get_cycle_consistency_loss
    cycle_consistency_preds = model.get_cycle_consistent_preds(inputs[-1], self.fg_masks)
  File "/data/hxy/dino-tracker/models/tracker.py", line 285, in get_cycle_consistent_preds
    target_source_coords = self.get_point_predictions(target_source_input, self.frame_embeddings)
  File "/data/hxy/dino-tracker/models/tracker.py", line 180, in get_point_predictions
    return self.get_point_predictions_from_embeddings(source_embeddings, frame_embeddings, target_frame_indices)
  File "/data/hxy/dino-tracker/models/tracker.py", line 173, in get_point_predictions_from_embeddings
    coords = self.tracker_head(self.cmap_relu(corr_maps))
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/networks/tracker_head.py", line 118, in forward
    refined_heatmap = self.softmax_heatmap(self.cnn_refiner(cost_volume)) # shape (B, 1, H, W)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/networks/conv_norm.py", line 46, in forward
    return F.conv2d(x, normalized_weights, bias=self.bias, stride=self.stride, padding=self.padding)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 430.00 MiB. GPU 0 has a total capacty of 23.69 GiB of which 285.62 MiB is free. Including non-PyTorch memory, this process has 23.27 GiB memory in use. Of the allocated memory 22.63 GiB is allocated by PyTorch, and 318.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I managed to fix it by changing the train_batch_size parameter to 128 in the ./config/train.yaml file, and that sorted things out, wrapping up the training successfully. But then, when I tried using this trained model for inference, the issue resurfaced, and tweaking the parameters didn't help either.

Traceback (most recent call last):
  File "/data/hxy/dino-tracker/./inference_grid.py", line 54, in <module>
    run(args)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/./inference_grid.py", line 25, in run
    model_inference = ModelInference(
  File "/data/hxy/dino-tracker/models/model_inference.py", line 91, in __init__
    self.model.cache_refined_embeddings()
  File "/data/hxy/dino-tracker/models/tracker.py", line 132, in cache_refined_embeddings
    refined_features, _ = self.get_refined_embeddings(torch.arange(0, self.video.shape[0]))
  File "/data/hxy/dino-tracker/models/tracker.py", line 125, in get_refined_embeddings
    refined_embeddings = frames_dino_embeddings + residual_embeddings
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.31 GiB. GPU 0 has a total capacty of 23.69 GiB of which 2.34 GiB is free. Including non-PyTorch memory, this process has 21.21 GiB memory in use. Of the allocated memory 20.89 GiB is allocated by PyTorch, and 23.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

So, how can I deal with it? I really need your help.

tnarek commented 4 months ago

hello @hihxy , thanks for the question. Have you tried modifying the --batch-size argument of inference_grid.py? If you did and it wasn't helpful, let us know and we'll think of ways to make the script more efficient / parallelizable.

Also just as a sidenote -- the reason why the training runs OOM at 5K is because we start applying the contrastive and cycle-consistency losses in the refined feature space after 5K, which demands more memory.

hihxy commented 4 months ago

Thanks for your serious reply @tnarek .

In the training step, I only changed the train_batch_size parameter to 128 in the ./config/train.yaml file, and it worked.

########################## data loader ##########################
video_resw: 854
video_resh: 476

fg_traj_ratio: 0.5
keep_traj_in_cpu: true # set to true for long videos (> 150 frames)
train_batch_size: 128   # defult 512
batch_n_frames: 4
# sampler_batch_iterations: 1000 # uncomment for large videos

And in the inference step, the ./config/train.yaml file in trainning step ( train_batch_size: 128 / 64 /32 I have tried ). I think it will change the --batch-size argument of inference_grid.py. Is it right?

After reading your reply, I have used the code above to infer, and changed the --batch-size (128,64,32...) but still runs OOM.

python ./inference_grid.py \
    --config ./config/train.yaml \
    --data-path ./dataset/ljc \
    --batch-size 64 \
    --use-segm-mask \
    > ./logs/inference.log 2>&1

The inference log output is the same as my first issue.

Traceback (most recent call last):
  File "/data/hxy/dino-tracker/./inference_grid.py", line 54, in <module>
    run(args)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/./inference_grid.py", line 25, in run
    model_inference = ModelInference(
  File "/data/hxy/dino-tracker/models/model_inference.py", line 91, in __init__
    self.model.cache_refined_embeddings()
  File "/data/hxy/dino-tracker/models/tracker.py", line 132, in cache_refined_embeddings
    refined_features, _ = self.get_refined_embeddings(torch.arange(0, self.video.shape[0]))
  File "/data/hxy/dino-tracker/models/tracker.py", line 125, in get_refined_embeddings
    refined_embeddings = frames_dino_embeddings + residual_embeddings
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.31 GiB. GPU 0 has a total capacty of 23.69 GiB of which 2.34 GiB is free. Including non-PyTorch memory, this process has 21.21 GiB memory in use. Of the allocated memory 20.89 GiB is allocated by PyTorch, and 23.22 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

tnarek commented 4 months ago

now I see that the reason for OOM is the caching of the refined embeddings. We do this to compute them only once for the entire video, and reuse them for all query points.

You can turn this line off in model_inference.py (line 91), but it will slow down the inference. Alternatively, trajectories can be computed for a subset of 100 frames separately, which will demand caching less refined embeddings.

hihxy commented 4 months ago

Thank you for your reply. I have tried to turn off this line, but it will also lead to OOM problem. model_inference.py

class ModelInference(torch.nn.Module):
    def __init__(
        self,
        model: Tracker,
        range_normalizer: RangeNormalizer,
        anchor_cosine_similarity_threshold: float = 0.5,
        cosine_similarity_threshold: float = 0.5,
        ) -> None:
        super().__init__()

        self.model = model
        self.model.eval()
        # self.model.cache_refined_embeddings()

        self.range_normalizer = range_normalizer
        self.anchor_cosine_similarity_threshold = anchor_cosine_similarity_threshold
        self.cosine_similarity_threshold = cosine_similarity_threshold

log information

nohup: ignoring input
/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
Traceback (most recent call last):
  File "/data/hxy/dino-tracker/./inference_grid.py", line 54, in <module>
    run(args)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/./inference_grid.py", line 39, in run
    grid_trajectories, grid_occlusions = model_inference.infer(grid_query_points, batch_size=args.batch_size)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/model_inference.py", line 212, in infer
    trajs = self.compute_trajectories(query_points, batch_size) # N x T x 3
  File "/data/hxy/dino-tracker/models/model_inference.py", line 98, in compute_trajectories
    trajecroies = generate_trajectories(
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/model_inference.py", line 70, in generate_trajectories
    trajectory_pred = generate_trajectory(query_point=query_point, video=video, model=model, range_normalizer=range_normalizer, dst_range=dst_range, use_raw_features=use_raw_features,
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/model_inference.py", line 51, in generate_trajectory
    trajectory_coordinate_preds_normalized = model(trajectory_input, use_raw_features=use_raw_features)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data/hxy/envs/dino-tracker/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data/hxy/dino-tracker/models/tracker.py", line 319, in forward
    frame_embeddings, residual_embeddings, raw_embeddings = self.get_refined_embeddings(frames_set_t, return_raw_embeddings=True)
  File "/data/hxy/dino-tracker/models/tracker.py", line 120, in get_refined_embeddings
    residual_embeddings = torch.zeros_like(frames_dino_embeddings)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.77 GiB. GPU 0 has a total capacty of 23.69 GiB of which 3.21 GiB is free. Including non-PyTorch memory, this process has 20.35 GiB memory in use. Of the allocated memory 20.02 GiB is allocated by PyTorch, and 12.06 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

May I ask is it because the tensor of frames_dino_embeddings is too large? How should I optimize it? By the way, does the dino-tracker support training and interfering with multiple GPUs? Look forward to your reply, thank you so mush.

tnarek commented 4 months ago

you are correct, the problem is that the frame embeddings do not fit in the GPU memory. The OOM happens even before computing the embeddings in line 120 of tracker.py, where we initialize a tensor of zeros where the residual embeddings are stored later on. So if this tensor does not fit in memory, the frame embeddings will not fit either.

currently, our code does not support parallel GPU's. a simple solution can be to store the raw DINO embeddings to a second GPU device, and move them to the first device only when they are necessary. You can see that before the OOM happens, the GPU already has 20 GB occupied, most of which I suspect are the pre-computed raw DINO embeddings.

hihxy commented 4 months ago

Thanks.

AssafSinger94 / dino-tracker

torch.cuda.OutOfMemoryError: CUDA out of memory #3