ZikangZhou / QCNet

[CVPR 2023] Query-Centric Trajectory Prediction
https://openaccess.thecvf.com/content/CVPR2023/papers/Zhou_Query-Centric_Trajectory_Prediction_CVPR_2023_paper.pdf
Apache License 2.0
481 stars 76 forks source link

Training cost too much #2

Open zaplm opened 1 year ago

zaplm commented 1 year ago

@ZikangZhou Hi Zhou! I noticed that when I train this code, it utilizes 32GB of GPU memory per GPU instead of the 20GB mentioned in the README.md. Could you please explain what might be causing this discrepancy in this repository?

zaplm commented 1 year ago

It didn't happen right at the beginning of training. As the training progresses, the GPU memory usage increases. In the first epoch, it occupies 20GB per GPU, but by the 30th epoch, the GPU memory usage reaches 32GB per GPU.

flclain commented 1 year ago

I encountered a similar phenomenon. The GPU memory grows with traning. Maybe the code has a memory leak problem. 1

ZikangZhou commented 1 year ago

Hi, are you watching the result of nvidia-smi?

zaplm commented 1 year ago

Yes, while using RTX 3090, I noticed that the program quickly ran out of GPU memory by epoch 2. However, when I switched to A100, I observed that it consumed nearly 33GB of memory per GPU during training about 40 epochs.

ZikangZhou commented 1 year ago

Hmm... I noticed a similar phenomenon during training. I think there may be several reasons. First, if the number of agents/map elements within a batch happens to be extremely large, then the peak memory cost will blow up. Second, there exists a memory leak in the code (but I can't identify it, maybe someone can help). Lastly, the memory usage displayed by nvidia-smi is not that accurate (I think it is merely showing the buffer size created by the program).

I usually use four A40 (48GB memory) to train models with batch size=8 per GPU. The memory usage shown by nvidia-smi varies from 30GB to ~40GB per GPU.

zaplm commented 1 year ago

Thanks for your response! I will investigate the underlying cause of the memory leak issue.

ghost commented 1 year ago

@zaplm @flclain @ZikangZhou Hi, I encountered similar problem, have you investigated the underlying cause of the memory leak issue? I train the model with batch size=4 per GPU on 4 RTX 4090(24GB), it's normal and costs 20GB per GPU at the beginning, but run out of GPU memory by epoch 3. I try to reduce batch size=2 per GPU, but still run out of GPU memory later.

zaplm commented 1 year ago

@Qingfeng800, you can also attempt to decrease the radius, which will also result in lower GPU memory usage.

ghost commented 1 year ago

@zaplm

@Qingfeng800, you can also attempt to decrease the radius, which will also result in lower GPU memory usage.

Yes, adjusting hyperparameters can reduce graphics memory, but it may affect accuracy. As the GPU memory gradually increases with model training, there must be bug within the code.

flclain commented 1 year ago

@zaplm @flclain @ZikangZhou Hi, I encountered similar problem, have you investigated the underlying cause of the memory leak issue? I train the model with batch size=4 per GPU on 4 RTX 4090(24GB), it's normal and costs 20GB per GPU at the beginning, but run out of GPU memory by epoch 3. I try to reduce batch size=2 per GPU, but still run out of GPU memory later.

I used two A800(80 GB) to train the model with batch size 16. The problem of memory leaks is fine for me, so I didn't check this problem. Maybe you can decrease your batch size and choose a smaller learning rate, if you are concerned about reducing the model performance.

harshy105 commented 1 year ago

Thanks for your response! I will investigate the underlying cause of the memory leak issue.

Hi @zaplm, have you been able to figure out the underlying memory leak issue? Otherwise we could maybe reopen the issue.

SunHaoOne commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

yenanjing commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

I tried to use model = torch.compile(model) and it reports the error:

RuntimeError: CUDA error: misaligned address

Is there any version mismatch or something wrong? There are my environment setups:

SunHaoOne commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

I tried to use model = torch.compile(model) and it reports the error:

RuntimeError: CUDA error: misaligned address

Is there any version mismatch or something wrong? These are my environment setup:

* PyTorch version: 2.0.1

* CUDA available: True

* CUDA version: 11.8

"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."

yenanjing commented 9 months ago

"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."

Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

with more environment setups:

SunHaoOne commented 9 months ago
$ pip list | grep torch
pytorch-lightning        2.1.2
pytorch-triton           2.1.0+e6216047b8
torch                    2.1.0+cu118
torch-geometric          2.3.1
torch-scatter            2.1.2
torch2trt                0.4.0
torchaudio               2.1.0+cu118
torchmetrics             1.2.0
torchvision              0.16.0+cu118****

This is my complete setup for the PyTorch-related environment. I've developed a straightforward radius function that filters indices based on calculated distances. In case this approach is not effective, feel free to create your own implementation of this function.

"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."

Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r, Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

with more environment setups:

* pytorch-lightning             2.1.2

* pytorch-triton                2.1.0+bcad9dabe1

* torch                         2.1.0+cu118

* torch-cluster                 1.6.3+pt21cu118

* torch_geometric               2.4.0

* torch-scatter                 2.1.2+pt21cu118

* torch-sparse                  0.6.17

* torch-spline-conv             1.2.2+pt21cu118

* torchaudio                    2.1.0+cu118

* torchmetrics                  0.11.4

* torchvision                   0.16.0+cu118
Syk-yr commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

I tried to use model = torch.compile(model) and it reports the error: raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule

sorry, I want to know where to add this command?

SunHaoOne commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

I tried to use model = torch.compile(model) and it reports the error: raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule

sorry, I want to know where to add this command?

In the train_qcnet.py model = QCNet(**vars(args)) Then add this code after it. model = torch.compile(model)

Syk-yr commented 9 months ago

Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.

I tried to use model = torch.compile(model) and it reports the error: raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule sorry, I want to know where to add this command?

In the train_qcnet.py model = QCNet(**vars(args)) Then add this code after it. model = torch.compile(model)

I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule

SunHaoOne commented 9 months ago

I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule

I believe it would be beneficial to verify the versions of PyTorch and PyTorch Lightning that were previously mentioned.

Syk-yr commented 9 months ago

"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."

Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:

torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r, Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True

with more environment setups:

  • pytorch-lightning 2.1.2
  • pytorch-triton 2.1.0+bcad9dabe1
  • torch 2.1.0+cu118
  • torch-cluster 1.6.3+pt21cu118
  • torch_geometric 2.4.0
  • torch-scatter 2.1.2+pt21cu118
  • torch-sparse 0.6.17
  • torch-spline-conv 1.2.2+pt21cu118
  • torchaudio 2.1.0+cu118
  • torchmetrics 0.11.4
  • torchvision 0.16.0+cu118

I have the same trouble as yours, are you solving it?

Syk-yr commented 9 months ago

I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule

I believe it would be beneficial to verify the versions of PyTorch and PyTorch Lightning that were previously mentioned.

Thanks for your answer, yes, it is my fault, and I update the pytorch-lightning, but so said, I have the same trouble as the before one? how can you solved this problem? """File "/home/syk/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_ops.py", line 692, in call return self._op(*args, *kwargs or {}) torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius((FakeTensor(..., device='cuda:0', size=(11550, 2)), FakeTensor(..., device='cuda:0', size=(11550, 2)), FakeTensor(..., device='cuda:0', size=(201,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(201,), dtype=torch.int64), 35, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.

from user code: File "/home/syk/miniconda3/envs/py38/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r,

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True"""

SunHaoOne commented 9 months ago

You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True"""

You can try to follow https://github.com/pytorch/pytorch/issues/95791#issuecomment-1595237235, or write your custom radius function, just calculate the distance and select the distance less than radius.

kennethweitzel commented 7 months ago

@SunHaoOne Could you provide some Code for your custom radius function? Did you replace the whole radius_graph function from torch-cluster or did you integrate your radius function into torch-cluster?

SunHaoOne commented 7 months ago

@SunHaoOne Could you provide some Code for your custom radius function? Did you replace the whole radius_graph function from torch-cluster or did you integrate your radius function into torch-cluster?

Hi, @kennethweitzel For example, I didn't use the 'r' parameter to calculate the distance. Instead, you can adjust the topk approach to incorporate radius parameters. The radius_graph function calculates the radius relation within itself, so it can serve as an approximation instead of using the conventional radius function

def radius_topK(xy_A: torch.Tensor, xy_B: torch.Tensor, k: int) -> torch.Tensor:
    """
    Find the top k nearest points in set A for each point in set B and return their indices.

    Args:
        xy_A: Coordinates of points in set A, shape [N, 2].
        xy_B: Coordinates of points in set B, shape [M, 2].
        k: The number of nearest neighbors to find for each point in B.

    Returns:
        A tensor of shape [2, M*k] where the first row contains repeated indices of points in B
        and the second row contains indices of the nearest points in A for each point in B.
    """
    device = xy_A.device
    N = xy_A.shape[0]
    M = xy_B.shape[0]

    # Expand xy_A and xy_B to compute pairwise distances
    xy_A_expanded = xy_A.unsqueeze(0).expand(M, N, 2).to(device)  # Shape: [M, N, 2]
    xy_B_expanded = xy_B.unsqueeze(1).expand(M, N, 2).to(device)  # Shape: [M, N, 2]

    # Compute the Euclidean distance between each pair of points
    rel_dist = torch.norm(xy_B_expanded - xy_A_expanded, dim=-1)  # Shape: [M, N]

    # Find the indices of the k nearest neighbors in A for each point in B
    nearest_idx_A = torch.topk(rel_dist, k, largest=False, sorted=True).indices  # Shape: [M, k]

    # Generate repeated indices for points in B
    idx_B = torch.arange(M, device=device).unsqueeze(-1).repeat(1, k).view(-1)  # Shape: [M*k]

    # Flatten the indices of nearest points in A
    nearest_idx_A_flat = nearest_idx_A.view(-1)  # Shape: [M*k]

    # Combine the indices of B and nearest points in A
    combined_indices = torch.stack((idx_B, nearest_idx_A_flat), dim=0)  # Shape: [2, M*k]

    return combined_indices
kennethweitzel commented 7 months ago

@SunHaoOne Thanks a lot! I have some further questions: What do you use for k in this case? Do you think this can entirely replace the radius function, as there are parts of code where the radius function is called directly with quite a specific radius (e.g. 150m or 50m) like in https://github.com/ZikangZhou/QCNet/blob/55cacb418cbbce3753119c1f157360e66993d0d0/modules/qcnet_agent_encoder.py#L157-L158 with radii defined in

python train_qcnet.py --root /path/to/dataset_root/ --train_batch_size 4 --val_batch_size 4 --test_batch_size 4 --devices 8 --dataset argoverse_v2 --num_historical_steps 50 --num_future_steps 60 --num_recurrent_steps 3 --pl2pl_radius 150 --time_span 10 --pl2a_radius 50 --a2a_radius 50 --num_t2m_steps 30 --pl2m_radius 150 --a2m_radius 150

And also the function you provided seems like it can't handle batches. How did you handle that? Thanks in advance!

SunHaoOne commented 7 months ago

Hi, @kennethweitzel

What do you use for k in this case?

Since we aim to deploy the model, using radius can cause dynamic dimensions due to non-zero elements, which leads to increased computation time. Therefore, we limit this using topK. Through data analysis, for example, by taking some coordinates from datasets to calculate connectivity relationships, we can identify a balance point that enables us to establish a correlation between the values of r and k. For instance, in my scenario, we use k = 15 but you can adjust k according to different r values, from r = 50, 100, 150.

And also the function you provided seems like it can't handle batches. How did you handle that?

To simplify for inference, we removed the batch dimension. My understanding is that the Batch data structure in torch_geometric is obtained by stacking each graph's relationships. Similarly, you can iterate over the batch dimension, calculate their connectivity relationships for each dimension, and then merge this dimension in the results.

xiaowuge1201 commented 4 months ago

hello, @SunHaoOne How to utilize the features of the previous frame for online deployment???

ares89 commented 4 months ago

Hi, @kennethweitzel

What do you use for k in this case?

Since we aim to deploy the model, using radius can cause dynamic dimensions due to non-zero elements, which leads to increased computation time. Therefore, we limit this using topK. Through data analysis, for example, by taking some coordinates from datasets to calculate connectivity relationships, we can identify a balance point that enables us to establish a correlation between the values of r and k. For instance, in my scenario, we use k = 15 but you can adjust k according to different r values, from r = 50, 100, 150.

And also the function you provided seems like it can't handle batches. How did you handle that?

To simplify for inference, we removed the batch dimension. My understanding is that the Batch data structure in torch_geometric is obtained by stacking each graph's relationships. Similarly, you can iterate over the batch dimension, calculate their connectivity relationships for each dimension, and then merge this dimension in the results.

radius() radius_graph() can be calculated in data processing stage.

SunHaoOne commented 4 months ago

radius() radius_graph() can be calculated in data processing stage.

You are right. The edge radius can be calculated offline. However, during online inference, this procedure cannot be accelerated by TensorRT. I think tt is recommended to use this function in the ONNX file.