Closed mttsky closed 1 year ago
I have not encountered this! I in fact used multi-gpu training for the models I released. Can you share more about your setup, and your train.sh
?
Thank you for your reply
python train_nuscenes.py \ --exp_name=${EXP_NAME} \ --max_iters=25000 \ --log_freq=1000 \ --dset='trainval' \ --batch_size=4 \ --grad_acc=5 \ --use_scheduler=True \ --data_dir=$DATA_DIR \ --log_dir='logs_nuscenes' \ --ckpt_dir='checkpoints' \ --res_scale=2 \ --ncams=6 \ --encoder_type='res50' \ --do_rgbcompress=True \ --device_ids=[0,1,2,3]
OK. So basically I don't know the answer here, but my strategy would be to try to simplify until it works.
For a start, how about: comment out the loss terms: https://github.com/aharley/simple_bev/blob/main/train_nuscenes.py#L203-L208
and just say total_loss = loss_fn(seg_bev_e, seg_bev_g, valid_bev_g)
hi, Adam W. did you encounter single card training is normal, but there is nan with multiple cards