questions about NaN - Githubissues

EcustBoy commented 1 year ago

Dear author:

You mentioned that directly use end-to-end training may cause NaN error, I would like to know what are the possible reasons of NAN? (actually I've tried end-to-end training with your original code on nuscenes before but Nan didn't appear)
1. Then I tried training your code on my own dataset（I'm pretty sure my data format is exactly same as the nuscenes dataloader format in your code, and no dirty data). I choose multi stage training(perception -> prediction -> planning)，the former two stage are normal, but in the last planning stage i meet NaN error in iteration N. But I turned on gradient clipping, and there was no nan in the forward loss value before the nan error appeared, it seems the nan appears in the grad backword process in the N-1 iteration instead of the forward loss overflow, so i guess if it's caused by precision limit in AMP training mode? I wanna ask if you have encountered the same weird error？Can you provide some suggestions for correcting such errors? Looking forward to discussing with you~ many thanks~

BeautyCJ commented 1 year ago

I have the same problem when training model on carla data. I choose multi stage training, the nan error comes when the second stage(Prediction) in iteration N.

EcustBoy commented 1 year ago

@BeautyCJ Hi, i try to add NaN detect code to check reason, and i found it's inevitable that some network layer can output INF value during model forward calculation, so finally i choose fp32 training mode in the planning stage and freeze some module pretrained from the prediction stage, in this way the whole training can be done

hli2020 commented 1 year ago

@BeautyCJ Hi, i try to add NaN detect code to check reason, and i found it's inevitable that some network layer can output INF value during model forward calculation, so finally i choose fp32 training mode in the planning stage and freeze some module pretrained from the prediction stage, in this way the whole training can be done

thank you for your information. hopefully it could shed some light on future attempts.

OpenDriveLab / ST-P3

questions about NaN #8