Fail to reproduce DSVT-P on nuscenes dataset.

Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"

https://arxiv.org/abs/2301.06051

Apache License 2.0

353 stars 28 forks source link

Fail to reproduce DSVT-P on nuscenes dataset. #55

Closed synsin0 closed 10 months ago

synsin0 commented 10 months ago

Hi, I'd like to reproduce DSVT-P on nuScenes dataset. I use the config you provided. I first eval the checkpoint and config, and gets the correct results, that is mAP=66; NDS=70. Then I train the same config with 8x A6000 GPUs with batch_size=4 for 20 epochs with cbgs. I can only get mAP=55; NDS=63 with checkpoint_20.pth. I wonder what's going wrong with my training process. Do I need more epochs?

Haiyang-W commented 10 months ago

Could you share your training log? I can easily get to 71 NDS on my end. We also provide a training log, which you can carefully compare with it.

Haiyang-W commented 10 months ago

The same config is ok.

synsin0 commented 10 months ago

I compared mine and your provided log. I forgot to add sync_bn as an argument! I have added it and restart my experiment. Hope for next reproduction. sync_bn may affect a lot.

Haiyang-W commented 10 months ago

Nice! You can try it first. If you have any questions, please feel free to contact with me.

Haiyang-W commented 10 months ago

Note that you should apply the fade strategy in your training process (turn off the gt sampling for the last 4 or 5 epochs).

Haiyang-W commented 10 months ago

Note that you should apply the fade strategy in your training process (turn off the gt sampling for the last 4 or 5 epochs).

It has been written into config, so you just run the provided config directly.

synsin0 commented 10 months ago

After turning on sync_bn, I still can not reproduce the reported metrics. I provide my training log so that you may find possible errors.

train_20230829-170810.log

Haiyang-W commented 10 months ago

After turning on sync_bn, I still can not reproduce the reported metrics. I provide my training log so that you may find possible errors.

train_20230829-170810.log

It seems from the log that you are not using dsvt's repo, but pcdet's latest official repo? They may have some differences in training. Could you completely use dsvt's repo to replicate? Your loss is much lower than ours.

Haiyang-W commented 10 months ago

By the way, could you use our repo to preprocess the nuscenes data? I'm not sure what's the difference between our nuscenes and the official pcdet? I will check it laterly.

Haiyang-W commented 10 months ago

If anyone can reproduce nuscenes please let me know in this issue.

synsin0 commented 10 months ago

I start to setup a new DSVT original repo and preprocess and train with default configs. Thanks for your patience.

Haiyang-W commented 10 months ago

I start to setup a new DSVT original repo and preprocess and train with default configs. Thanks for your patience.

If you face some problems, please email me, so that we can look for problems together.

synsin0 commented 10 months ago

Here is my new training log. I strictly follow the DSVT repo. May you take a look at why my log has only half of the training loss as yours. log_train_20230831-141035.txt

chenshi3 commented 10 months ago

Here is my new training log. I strictly follow the DSVT repo. May you take a look at why my log has only half of the training loss as yours. log_train_20230831-141035.txt

After the upcoming deadline, which is just two days away, I will review your log and code to identify and address any potential issues or errors.

Haiyang-W commented 10 months ago

Here is my new training log. I strictly follow the DSVT repo. May you take a look at why my log has only half of the training loss as yours. log_train_20230831-141035.txt

The print log interval we choose is 1000, So you can run this a little bit longer, or you can restart the program and set the print interval to 1000 as well. 1000 is a little bit more stable.

Could you set it to 1000? We can do a better comparison.

synsin0 commented 10 months ago

I restart and get similar loss at the first interval=1000, that is loss=9.1516, which is similar to your log loss=9.6196. I will keep track on this experiment to see whether it can exceed NDS=70. If yes, I will close this issue.

Haiyang-W commented 10 months ago

I restart and get similar loss at the first interval=1000, that is loss=9.1516, which is similar to your log loss=9.6196. I will keep track on this experiment to see whether it can exceed NDS=70. If yes, I will close this issue.

Great! We reproduced this code before release (two month ago), and there should be no problem. We hope you can reproduce it.

Looking forward to your good news, we will also run again soon. :)

synsin0 commented 10 months ago

My final reproduced result is mAP=0.6668, NDS=0.7122 for checkpoint_20th. Thanks for your help of getting correct results!

Haiyang-W commented 10 months ago

My final reproduced result is mAP=0.6668, NDS=0.7122 for checkpoint_20th. Thanks for your help of getting correct results!

Great! Thanks for your timely reply.