The single card training is normal, but there is nan with multiple cards

aharley / simple_bev

A Simple Baseline for BEV Perception

MIT License

502 stars 79 forks source link

The single card training is normal, but there is nan with multiple cards #15

Closed mttsky closed 1 year ago

mttsky commented 2 years ago

hi, Adam W. did you encounter single card training is normal, but there is nan with multiple cards

aharley commented 2 years ago

I have not encountered this! I in fact used multi-gpu training for the models I released. Can you share more about your setup, and your train.sh?

mttsky commented 2 years ago

Thank you for your reply

python train_nuscenes.py \ --exp_name=${EXP_NAME} \ --max_iters=25000 \ --log_freq=1000 \ --dset='trainval' \ --batch_size=4 \ --grad_acc=5 \ --use_scheduler=True \ --data_dir=$DATA_DIR \ --log_dir='logs_nuscenes' \ --ckpt_dir='checkpoints' \ --res_scale=2 \ --ncams=6 \ --encoder_type='res50' \ --do_rgbcompress=True \ --device_ids=[0,1,2,3]

aharley commented 2 years ago

OK. So basically I don't know the answer here, but my strategy would be to try to simplify until it works.

For a start, how about: comment out the loss terms: https://github.com/aharley/simple_bev/blob/main/train_nuscenes.py#L203-L208

and just say total_loss = loss_fn(seg_bev_e, seg_bev_g, valid_bev_g)