fudan-zvg / RoadNet

[ICCV2023 Oral] RoadNetworkTRansformer & [AAAI 2024] LaneGraph2Seq
MIT License
65 stars 5 forks source link

About training time #3

Open ZJWang9928 opened 5 months ago

ZJWang9928 commented 5 months ago

Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal? image

BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?

VictorLlu commented 5 months ago

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

VictorLlu commented 5 months ago

The checkpoint is the pretraining checkpoint, Please refine to issue https://github.com/fudan-zvg/RoadNet/issues/2#issuecomment-2004486882

ZJWang9928 commented 5 months ago

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

https://github.com/fudan-zvg/RoadNet/blob/9a83cf6aa09896e6df6c36c3a534e9b9ab075a7b/RoadNetwork/rntr/init.py#L24C1-L24C52

Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it? from .data import nuscenes_converter_pon_centerline

5

ZJWang9928 commented 5 months ago

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal? NUM_GPUs = 1: image

NUM_GPUs = 2: image

NUM_GPUs = 4: image

NUM_GPUs = 8:

image
ZJWang9928 commented 5 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

VictorLlu commented 5 months ago

I've made a minor modification to the image loading process:

img_bytes = [
    get(name, backend_args=self.backend_args) for name in filename
]
img = [
    mmcv.imfrombytes(img_byte, flag=self.color_type)
    for img_byte in img_bytes
]
img = np.stack(img, axis=-1)

This approach replaces the use of mmcv.imread. It has provided some improvement, yet the loading time remains significantly long.

VictorLlu commented 5 months ago

I find it highly related to the num_workers

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

I've noticed that the delay between iterations directly corresponds to the num_workers setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.

VictorLlu commented 5 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

EchoQiHeng commented 4 months ago

When I use a single 2080GPU, it takes 59 days to complete the training......

ZJWang9928 commented 4 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

raimberri commented 2 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues https://github.com/open-mmlab/mmdetection/issues/11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

y352184741 commented 2 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image

image

raimberri commented 2 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image

image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

y352184741 commented 2 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

Hi~ Have you finished the training and successfully reproduced the results from the paper?

wangpinzhi commented 2 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

hello,did you reproduce the results in paper?

raimberri commented 1 month ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

Here's a polished version: The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

Hi, I have found a solution in MMdetection issues open-mmlab/mmdetection#11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training? image image

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log Screenshot from 2024-06-19 14-42-42 However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

hello,did you reproduce the results in paper?

FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution. Screenshot from 2024-07-30 17-13-35