Cannot reproduce the VSS results with two gpus

CVMI-Lab / DODA

(ECCV 2022) DODA: Data-oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation

Apache License 2.0

46 stars 4 forks source link

Cannot reproduce the VSS results with two gpus #4

Closed Trainingzy closed 2 years ago

Trainingzy commented 2 years ago

Thanks for your great work.

You use 8 gpus with batch size 4 per gpu. I try to reproduce your VSS results with two gpus, 16 samples per gpu. However, the mIoU on ScanNet is only 37.

In addition, I notice that you do not use sync bn. When I turn it on, I got a mIoU of 37.52 and 43.12 over three runs on ScanNet and S3DIS respectively, which is still far from the reported results (40.52 and 46.85).

Is there anything I missed that caused the poor results?

Dingry commented 2 years ago

Hi, I didn't try 2 gpus with large batch size to run this experiment. But it's recommanded for you to try 8 gpus since large batch size may lead to performance degradation empirically. We will try your batch size setting later.

Dingry commented 2 years ago

Beside, have you tried to reduce the point_range in the config file, since large batch size will cause out-of-volume error like this batchSize * outputVolume < std::numeric_limits<int>::max()?

Trainingzy commented 2 years ago

Thanks for your prompt reply. I do not have 8 gpus to reproduce the results.

BTW, It is kind of weird that you do not use sync bn, even though your code supports it. Sync BN can try to make the results consistent regardless of how many gpus are used. May I ask for the results when using sync bn?

Trainingzy commented 2 years ago

Beside, have you tried to reduce the point_range in the config file, since large batch size will cause out-of-volume error like this batchSize * outputVolume < std::numeric_limits<int>::max()?

I keep all the other configs the same except for the batch size per gpu. I am a little bit confused about this. Do u mean ``out-of-memory'' that a single gpu can not hold large batch size?

Dingry commented 2 years ago

I have tried syncbn with pytorch=1.5 and the performance stays the same. To my knowledge, syncbn with pytorch version <=1.5 has bugs so we do not use it in our official release (Please refer to https://github.com/facebookresearch/detectron2/blob/32b61e64c76118b2e9fc2237f283a8e9c938bd16/detectron2/layers/batch_norm.py#L154). Expirically, large batch size may lead to performance degradation due to BN schastics in 3D seg or detection tasks (from my friends who also work on 3D tasks). Besides, we cannot run batch size 16 on a single GPU since the spconv 1.2 cannot handle large voxel volumns and report errors frequently. Thus, we cannot reproce your problem. I am not sure whether you installed another spconv version?

Trainingzy commented 2 years ago

I am using the following environment

python 3.7.13
pytorch 1.8.2
spconv 2.1.21

The environment seems to be different from yours. I cannot use your environment with CUDA 11.1 and NVIDIA GTX 3090

Dingry commented 2 years ago

We are not compatible with spconv 2.x now. Using spconv 2.x causes performance drops. I will try to figure out why and support spconv 2.x in the future. Thanks!

Trainingzy commented 2 years ago

Okay, thanks.