Open Rurouni-z opened 6 months ago
and i dont know why i always got OOM in training
i use single A100 to train model, but in a multi-GPU environment
and my command set up a specific GPU:
CUDA_VISIBLE_DEVICES=2 python3 tools/train.py configs/stpls3d/isbnet_stpls3d.yaml --trainall --exp_name default
update:
I tried changing fp16=true in backbone.yaml and got the worst results, but if I don't set true I have to set fp16=false in stpls3d.yaml which causes recurring OOM
https://github.com/VinAIResearch/ISBNet/blob/44beb835a25b91d98a7b8ad35b90969b941b8f3b/configs/stpls3d/isbnet_stpls3d.yaml#L81
if i set true, and use pretrained model from isbnet_backbone_stpls3d.yaml then i got error:
otherwise i can train the all network