ZJWang9928 commented 5 months ago

Hello, I am running the training code of lanegraph2seq on nuScenes. Each batch takes about 3.6 seconds, and hence the total training process will take about 20 days. Is this speed normal?

BTW, would it be possible for you to release the pre-trained checkpoint ckpts/lssego_segmentation_48x32_b4x8_resnet_adam_24e_ponsplit_19.pth?

VictorLlu commented 5 months ago

Apologies for the delay due to the ECCV submission deadline. In compliance with our company's confidentiality policies, the original code cannot be published. This version of the code has been independently reproduced. While the original code is known for its speed, I'm currently unable to determine the cause of performance issues in this version. I am actively investigating the matter.

VictorLlu commented 5 months ago

The checkpoint is the pretraining checkpoint, Please refine to issue https://github.com/fudan-zvg/RoadNet/issues/2#issuecomment-2004486882

ZJWang9928 commented 5 months ago

https://github.com/fudan-zvg/RoadNet/blob/9a83cf6aa09896e6df6c36c3a534e9b9ab075a7b/RoadNetwork/rntr/init.py#L24C1-L24C52

Hi! @VictorLlu Thank your for your update. But the following file seems still to be missing... Can you please update it? from .data import nuscenes_converter_pon_centerline

5

ZJWang9928 commented 5 months ago

@VictorLlu Comparing training with 1 GPU and 8 GPUs, I found that the batch time almost equals to NUM_GPUs*batch_time_per_GPU + $\Delta$. Is this phenomenon abnormal? NUM_GPUs = 1:

NUM_GPUs = 2:

NUM_GPUs = 4:

NUM_GPUs = 8:

ZJWang9928 commented 5 months ago

@VictorLlu FYI, I print time for data preprocessing (time1), forward propagation (time2) and back propagation (time3) in mmengine/model/wrappers/distributed.py when training with 8 GPUs. It seems that the major cause for the long average batch time is the extremely slow back propagation in some iterations. af2983a31d61f9c1b09cbd5a2c8a30a8 BTW, could you please update the nuscenes_converter_pon_centerline file recently? It would be of great help! Thank you.

VictorLlu commented 5 months ago

I've made a minor modification to the image loading process:

img_bytes = [
    get(name, backend_args=self.backend_args) for name in filename
]
img = [
    mmcv.imfrombytes(img_byte, flag=self.color_type)
    for img_byte in img_bytes
]
img = np.stack(img, axis=-1)

This approach replaces the use of mmcv.imread. It has provided some improvement, yet the loading time remains significantly long.

VictorLlu commented 5 months ago

I find it highly related to the num_workers

I've noticed that the delay between iterations directly corresponds to the num_workers setting in multi-GPU training scenarios. Despite eliminating every time-consuming element in the dataloader, it still experiences delays at intervals consistent with the num_workers count. This suggests that the issue might stem from mmdetection3d rather than the dataloader itself.

VictorLlu commented 5 months ago

Here's a polished version:

The issue has also been identified in other models of mmdetection3d, suggesting it might be a problem inherent to this version. I will push a mmdetection 0.17.1 version in the next few days.

EchoQiHeng commented 4 months ago

When I use a single 2080GPU, it takes 59 days to complete the training......

ZJWang9928 commented 4 months ago

Hi! As it has been a 2 weeks, when would you please update the mmdetection 0.17.0 version? It can be of significant help.

raimberri commented 2 months ago

Hi, I have found a solution in MMdetection issues https://github.com/open-mmlab/mmdetection/issues/11503, which is updating your pytorch version >= 2.2. I have tested it and it successfully reduced the training time from 25 days to 4 days 20240617-143821 20240617-143826 Hopefully this works for you. BTW, I used latest pytorch-2.3.1 by conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia with other environment unchanged.

y352184741 commented 2 months ago

Hi, i have updated pytorch to the latest version and reduced the training time successfully. But, the grads become NaN after some certain iterations and the losses become 0. Did you have this problem during training?

raimberri commented 2 months ago

I met the same problem, what I have tried was changing batch_size from 2 to 4 and lr_rate from 2e-4 to 1e-4, then the problem was gone and the model is normally training. original hyperparameter batch_size=2, lr_rate=2e-4 log Screenshot from 2024-06-19 14-41-50 modified hyperparameter batch_size=4, lr_rate=1e-4 log However, I didn't dive into it in detail since the model isn't finishing training yet, so I can only offer a shallow assumption that this problem was caused by some abnormal data input/gt, and expand the batch_size may mitigate the impact.

y352184741 commented 2 months ago

Hi~ Have you finished the training and successfully reproduced the results from the paper?

wangpinzhi commented 2 months ago

hello，did you reproduce the results in paper?

raimberri commented 1 month ago

FYI, I do have some results, not so good, shown below. Since the model didn't converge well(probably caused by the hyperparameter settings and limited GPU resources) and I didn't spend much time on optimizing it and implementing well-designed visualization script, waiting for official released model weights and visualization script would be an ultimate solution. Screenshot from 2024-07-30 17-13-35

fudan-zvg / RoadNet

About training time #3

5