Train fp16 interrupt - Githubissues

5Yesterday commented 4 years ago

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic 1500 265 1500 265 Every epoch need 188 iterations Note that dataloader may hang with too much nworkers. DLoss: 6.0000 Reg: 0.0000

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV) I installed apex, and use the fp16 config, output this.

5Yesterday commented 4 years ago

when I run in terminal, got 段错误(核心已转储). Maybe I should locate it step by step?

layumi commented 4 years ago

Hi @5Yesterday Please check the cuda version and your pytorch cuda version. Are they matched? And please check the installation of apex. Do you successfully compile the apex with gcc 5+ ?

5Yesterday commented 4 years ago

@layumi Thanks, it works. I have trained it when not use fp16, so it's the gcc problem, gcc-5 work.

NVlabs / DG-Net

Train fp16 interrupt #61