Open zaplm opened 1 year ago
It didn't happen right at the beginning of training. As the training progresses, the GPU memory usage increases. In the first epoch, it occupies 20GB per GPU, but by the 30th epoch, the GPU memory usage reaches 32GB per GPU.
I encountered a similar phenomenon. The GPU memory grows with traning. Maybe the code has a memory leak problem.
Hi, are you watching the result of nvidia-smi?
Yes, while using RTX 3090, I noticed that the program quickly ran out of GPU memory by epoch 2. However, when I switched to A100, I observed that it consumed nearly 33GB of memory per GPU during training about 40 epochs.
Hmm... I noticed a similar phenomenon during training. I think there may be several reasons. First, if the number of agents/map elements within a batch happens to be extremely large, then the peak memory cost will blow up. Second, there exists a memory leak in the code (but I can't identify it, maybe someone can help). Lastly, the memory usage displayed by nvidia-smi is not that accurate (I think it is merely showing the buffer size created by the program).
I usually use four A40 (48GB memory) to train models with batch size=8 per GPU. The memory usage shown by nvidia-smi varies from 30GB to ~40GB per GPU.
Thanks for your response! I will investigate the underlying cause of the memory leak issue.
@zaplm @flclain @ZikangZhou Hi, I encountered similar problem, have you investigated the underlying cause of the memory leak issue? I train the model with batch size=4 per GPU on 4 RTX 4090(24GB), it's normal and costs 20GB per GPU at the beginning, but run out of GPU memory by epoch 3. I try to reduce batch size=2 per GPU, but still run out of GPU memory later.
@Qingfeng800, you can also attempt to decrease the radius, which will also result in lower GPU memory usage.
@zaplm
@Qingfeng800, you can also attempt to decrease the radius, which will also result in lower GPU memory usage.
Yes, adjusting hyperparameters can reduce graphics memory, but it may affect accuracy. As the GPU memory gradually increases with model training, there must be bug within the code.
@zaplm @flclain @ZikangZhou Hi, I encountered similar problem, have you investigated the underlying cause of the memory leak issue? I train the model with batch size=4 per GPU on 4 RTX 4090(24GB), it's normal and costs 20GB per GPU at the beginning, but run out of GPU memory by epoch 3. I try to reduce batch size=2 per GPU, but still run out of GPU memory later.
I used two A800(80 GB) to train the model with batch size 16. The problem of memory leaks is fine for me, so I didn't check this problem. Maybe you can decrease your batch size and choose a smaller learning rate, if you are concerned about reducing the model performance.
Thanks for your response! I will investigate the underlying cause of the memory leak issue.
Hi @zaplm, have you been able to figure out the underlying memory leak issue? Otherwise we could maybe reopen the issue.
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
I tried to use model = torch.compile(model)
and it reports the error:
RuntimeError: CUDA error: misaligned address
Is there any version mismatch or something wrong? There are my environment setups:
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
I tried to use
model = torch.compile(model)
and it reports the error:RuntimeError: CUDA error: misaligned address
Is there any version mismatch or something wrong? These are my environment setup:
* PyTorch version: 2.0.1 * CUDA available: True * CUDA version: 11.8
"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."
"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."
Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in
return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r, Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
with more environment setups:
$ pip list | grep torch
pytorch-lightning 2.1.2
pytorch-triton 2.1.0+e6216047b8
torch 2.1.0+cu118
torch-geometric 2.3.1
torch-scatter 2.1.2
torch2trt 0.4.0
torchaudio 2.1.0+cu118
torchmetrics 1.2.0
torchvision 0.16.0+cu118****
This is my complete setup for the PyTorch-related environment. I've developed a straightforward radius function that filters indices based on calculated distances. In case this approach is not effective, feel free to create your own implementation of this function.
"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."
Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r, Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
with more environment setups:
* pytorch-lightning 2.1.2 * pytorch-triton 2.1.0+bcad9dabe1 * torch 2.1.0+cu118 * torch-cluster 1.6.3+pt21cu118 * torch_geometric 2.4.0 * torch-scatter 2.1.2+pt21cu118 * torch-sparse 0.6.17 * torch-spline-conv 1.2.2+pt21cu118 * torchaudio 2.1.0+cu118 * torchmetrics 0.11.4 * torchvision 0.16.0+cu118
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
I tried to use model = torch.compile(model) and it reports the error:
raise TypeError(f"Trainer.fit()
requires a LightningModule
, got: {model.class.qualname}")
TypeError: Trainer.fit()
requires a LightningModule
, got: OptimizedModule
sorry, I want to know where to add this command?
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
I tried to use model = torch.compile(model) and it reports the error: raise TypeError(f"
Trainer.fit()
requires aLightningModule
, got: {model.class.qualname}") TypeError:Trainer.fit()
requires aLightningModule
, got: OptimizedModulesorry, I want to know where to add this command?
In the train_qcnet.py
model = QCNet(**vars(args))
Then add this code after it.
model = torch.compile(model)
Anyone encountering a situation where the GPU memory is insufficient can reduce memory usage significantly by using the command model = torch.compile(model). In my experimental setup, this approach reduced GPU memory consumption by approximately 70%.
I tried to use model = torch.compile(model) and it reports the error: raise TypeError(f"
Trainer.fit()
requires aLightningModule
, got: {model.class.qualname}") TypeError:Trainer.fit()
requires aLightningModule
, got: OptimizedModule sorry, I want to know where to add this command?In the
train_qcnet.py
model = QCNet(**vars(args)) Then add this code after it. model = torch.compile(model)
I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule
I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule
I believe it would be beneficial to verify the versions of PyTorch and PyTorch Lightning that were previously mentioned.
"I am currently using PyTorch version 2.1.0 (with CUDA 11.8 support) and PyTorch Lightning version 2.1.2. I've encountered an issue related to 'self.log' conflicts. A similar issue has been discussed in this GitHub thread: PyTorch Lightning GitHub Issue #18835."
Thanks for your sharing. I reinstalled the pytorch version to match yours, the original problem is gone, but a new bug appears:
torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius(*(FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(561, 2)), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(9,), dtype=torch.int64), 30.0, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory. from user code: File "/home/lww/anaconda3/envs/QCNet/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in return torch.ops.torch_cluster.radius(x, y, ptr_x, ptr_y, r, Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True
with more environment setups:
- pytorch-lightning 2.1.2
- pytorch-triton 2.1.0+bcad9dabe1
- torch 2.1.0+cu118
- torch-cluster 1.6.3+pt21cu118
- torch_geometric 2.4.0
- torch-scatter 2.1.2+pt21cu118
- torch-sparse 0.6.17
- torch-spline-conv 1.2.2+pt21cu118
- torchaudio 2.1.0+cu118
- torchmetrics 0.11.4
- torchvision 0.16.0+cu118
I have the same trouble as yours, are you solving it?
I also added this line of code where you mentioned it, but it still reported an error:raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}") TypeError: Trainer.fit() requires a LightningModule, got: OptimizedModule
I believe it would be beneficial to verify the versions of PyTorch and PyTorch Lightning that were previously mentioned.
Thanks for your answer, yes, it is my fault, and I update the pytorch-lightning, but so said, I have the same trouble as the before one? how can you solved this problem? """File "/home/syk/miniconda3/envs/py38/lib/python3.8/site-packages/torch/_ops.py", line 692, in call return self._op(*args, *kwargs or {}) torch._dynamo.exc.TorchRuntimeError: Failed running call_function torch_cluster.radius((FakeTensor(..., device='cuda:0', size=(11550, 2)), FakeTensor(..., device='cuda:0', size=(11550, 2)), FakeTensor(..., device='cuda:0', size=(201,), dtype=torch.int64), FakeTensor(..., device='cuda:0', size=(201,), dtype=torch.int64), 35, 301, 1), **{}): The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
from user code:
File "/home/syk/miniconda3/envs/py38/lib/python3.8/site-packages/torch_cluster/radius.py", line 82, in
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True"""
You can suppress this exception and fall back to eager by setting: import torch._dynamo torch._dynamo.config.suppress_errors = True"""
You can try to follow https://github.com/pytorch/pytorch/issues/95791#issuecomment-1595237235, or write your custom radius function, just calculate the distance and select the distance less than radius.
@SunHaoOne Could you provide some Code for your custom radius function? Did you replace the whole radius_graph function from torch-cluster or did you integrate your radius function into torch-cluster?
@SunHaoOne Could you provide some Code for your custom radius function? Did you replace the whole radius_graph function from torch-cluster or did you integrate your radius function into torch-cluster?
Hi, @kennethweitzel
For example, I didn't use the 'r' parameter to calculate the distance. Instead, you can adjust the topk
approach to incorporate radius parameters
. The radius_graph
function calculates the radius relation within itself, so it can serve as an approximation instead of using the conventional radius
function
def radius_topK(xy_A: torch.Tensor, xy_B: torch.Tensor, k: int) -> torch.Tensor:
"""
Find the top k nearest points in set A for each point in set B and return their indices.
Args:
xy_A: Coordinates of points in set A, shape [N, 2].
xy_B: Coordinates of points in set B, shape [M, 2].
k: The number of nearest neighbors to find for each point in B.
Returns:
A tensor of shape [2, M*k] where the first row contains repeated indices of points in B
and the second row contains indices of the nearest points in A for each point in B.
"""
device = xy_A.device
N = xy_A.shape[0]
M = xy_B.shape[0]
# Expand xy_A and xy_B to compute pairwise distances
xy_A_expanded = xy_A.unsqueeze(0).expand(M, N, 2).to(device) # Shape: [M, N, 2]
xy_B_expanded = xy_B.unsqueeze(1).expand(M, N, 2).to(device) # Shape: [M, N, 2]
# Compute the Euclidean distance between each pair of points
rel_dist = torch.norm(xy_B_expanded - xy_A_expanded, dim=-1) # Shape: [M, N]
# Find the indices of the k nearest neighbors in A for each point in B
nearest_idx_A = torch.topk(rel_dist, k, largest=False, sorted=True).indices # Shape: [M, k]
# Generate repeated indices for points in B
idx_B = torch.arange(M, device=device).unsqueeze(-1).repeat(1, k).view(-1) # Shape: [M*k]
# Flatten the indices of nearest points in A
nearest_idx_A_flat = nearest_idx_A.view(-1) # Shape: [M*k]
# Combine the indices of B and nearest points in A
combined_indices = torch.stack((idx_B, nearest_idx_A_flat), dim=0) # Shape: [2, M*k]
return combined_indices
@SunHaoOne Thanks a lot! I have some further questions: What do you use for k in this case? Do you think this can entirely replace the radius function, as there are parts of code where the radius function is called directly with quite a specific radius (e.g. 150m or 50m) like in https://github.com/ZikangZhou/QCNet/blob/55cacb418cbbce3753119c1f157360e66993d0d0/modules/qcnet_agent_encoder.py#L157-L158 with radii defined in
python train_qcnet.py --root /path/to/dataset_root/ --train_batch_size 4 --val_batch_size 4 --test_batch_size 4 --devices 8 --dataset argoverse_v2 --num_historical_steps 50 --num_future_steps 60 --num_recurrent_steps 3 --pl2pl_radius 150 --time_span 10 --pl2a_radius 50 --a2a_radius 50 --num_t2m_steps 30 --pl2m_radius 150 --a2m_radius 150
And also the function you provided seems like it can't handle batches. How did you handle that? Thanks in advance!
Hi, @kennethweitzel
What do you use for k in this case?
Since we aim to deploy the model, using radius can cause dynamic
dimensions due to non-zero
elements, which leads to increased computation time. Therefore, we limit this using topK
. Through data analysis, for example, by taking some coordinates from datasets to calculate connectivity relationships, we can identify a balance point that enables us to establish a correlation between the values of r and k. For instance, in my scenario, we use k = 15
but you can adjust k
according to different r
values, from r = 50, 100, 150
.
And also the function you provided seems like it can't handle batches. How did you handle that?
To simplify for inference, we removed the batch
dimension. My understanding is that the Batch data
structure in torch_geometric
is obtained by stacking each graph's relationships. Similarly, you can iterate over the batch
dimension, calculate their connectivity relationships for each dimension, and then merge this dimension in the results.
hello, @SunHaoOne How to utilize the features of the previous frame for online deployment???
Hi, @kennethweitzel
What do you use for k in this case?
Since we aim to deploy the model, using radius can cause
dynamic
dimensions due tonon-zero
elements, which leads to increased computation time. Therefore, we limit this usingtopK
. Through data analysis, for example, by taking some coordinates from datasets to calculate connectivity relationships, we can identify a balance point that enables us to establish a correlation between the values of r and k. For instance, in my scenario, we usek = 15
but you can adjustk
according to differentr
values, fromr = 50, 100, 150
.And also the function you provided seems like it can't handle batches. How did you handle that?
To simplify for inference, we removed the
batch
dimension. My understanding is that theBatch data
structure intorch_geometric
is obtained by stacking each graph's relationships. Similarly, you can iterate over thebatch
dimension, calculate their connectivity relationships for each dimension, and then merge this dimension in the results.
radius() radius_graph() can be calculated in data processing stage.
radius() radius_graph() can be calculated in data processing stage.
You are right. The edge radius can be calculated offline. However, during online inference, this procedure cannot be accelerated by TensorRT. I think tt is recommended to use this function in the ONNX file.
@ZikangZhou Hi Zhou! I noticed that when I train this code, it utilizes 32GB of GPU memory per GPU instead of the 20GB mentioned in the README.md. Could you please explain what might be causing this discrepancy in this repository?