Can not reproduce the results on "foggy cityscapes" due to Out of Memory Issue

onkarkris commented 2 years ago

I am getting "Cannot allocate memory error" after around 13-15k iterations while trying to reproduce results on "foggy cityscapes" dataset. I running this code on 4 GPUs with 360G memory.

I can reproduce VOC results on the same machine! The error is only on cityscapes dataset. I doubt the memory storage keeps increasing with iterations

environment Python 3.7.10, torch=1.7.0, torchvision=0.8.1, detectron2=0.5

cfg parameters used for my trail: MAX_ITER: 100000 IMG_PER_BATCH_LABEL: 8 IMG_PER_BATCH_UNLABEL: 8 BASE_LR: 0.04 BURN_UP_STEP: 20000 EVAL_PERIOD: 1000 NUM_WORKERS: 4

Error ImportError: /scratch/1/ace14705nl/adaptive_teacher/.venv/lib/python3.7/site-packages/PIL/_imaging.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

UPDATE: when I tried this experiment on other GPU cluster; 4 GPUs V100NVLINK 256G I could run this code for 28K and get AP@50 around 46 but again the process gets terminated due to memory issue.

"iterations =>> PBS: job killed: mem 269239236kb exceeded limit 268435456kb"

I cant figure out why so much memory (269G) is required while running this code on cityscapes dataset. I will highly apricate any help. Thanks.

onkarkris commented 2 years ago

I am getting "Cannot allocate memory error" after around 13-15k iterations while trying to reproduce results on "foggy cityscapes" dataset. I running this code on 4 GPUs with 360G memory.

I can reproduce VOC results on the same machine! The error is only on cityscapes dataset. I doubt the memory storage keeps increasing with iterations

environment Python 3.7.10, torch=1.7.0, torchvision=0.8.1, detectron2=0.5

cfg parameters used for my trail: MAX_ITER: 100000 IMG_PER_BATCH_LABEL: 8 IMG_PER_BATCH_UNLABEL: 8 BASE_LR: 0.04 BURN_UP_STEP: 20000 EVAL_PERIOD: 1000 NUM_WORKERS: 4

Error ImportError: /scratch/1/ace14705nl/adaptive_teacher/.venv/lib/python3.7/site-packages/PIL/_imaging.cpython-37m-x86_64-linux-gnu.so: failed to map segment from shared object: Cannot allocate memory

UPDATE: when I tried this experiment on other GPU cluster; 4 GPUs V100NVLINK 256G I could run this code for 28K and get AP@50 around 46 but again the process gets terminated due to memory issue.

"iterations =>> PBS: job killed: mem 269239236kb exceeded limit 268435456kb"

I cant figure out why so much memory (269G) is required while running this code on cityscapes dataset. I will highly apricate any help. Thanks.

I solved this issue by updating torch version (1.7.0-->1.8.1) and could reproduce AP50 around 49. Thanks!

yujheli commented 1 year ago

@onkarkris Sry for the late reply. Glad to hear that the issue is solved. Will update the environment accordingly based on your suggenstion.

onkarkris commented 1 year ago

@onkarkris Sry for the late reply. Glad to hear that the issue is solved. Will update the environment accordingly based on your suggenstion.

@yujheli Thanks for your reply. But I am still struggling to get the accuracy reported in the paper (50.9), my reproduced results are 48.7 on 4 GPUs with IMG_PER_BATCH_LABEL: 8, IMG_PER_BATCH_UNLABEL: 8, BASE_LR: 0.04. Rest of the settings are same as original config file uploaded by you.

I will really appreciate if you can suggest or help me to get the values reported in paper. Any change in config file?

yujheli commented 1 year ago

Try reduce UNSUP_LOSS_WEIGHT from 1.0 to 0.5 or 0.25. This would usually give me a 2~3 percent performance gain. Also, try to train the model longer since I remembered I need to have the best result after 70k using 16 batch size.

Yorionice1 commented 1 year ago

Can you reproduce the performance in the paper without multi-scale training or multi-scale testing?

Yorionice1 commented 1 year ago

I train my model on 4 3090, the experiments settings are same with you, but I can only reproduce AP50 around 46. I will appreciate it if you could share your training experience.

onkarkris commented 1 year ago

Can you reproduce the performance in the paper without multi-scale training or multi-scale testing? @Yorionice1 What is multi-scale training? Do you mean ResNet + FPN??

facebookresearch / adaptive_teacher

Can not reproduce the results on "foggy cityscapes" due to Out of Memory Issue #33