(base) root@e9f21ccb6520:/workspace/apex/examples/simple/distributed# bash run.sh
Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled : True
opt_level : O1
cast_model_type : None
patch_torch_functions : True
keep_batchnorm_fp32 : None
master_weights : None
loss_scale : dynamic
——————————————————————
The cursor is waiting for the results always...
But when I set “--nproc_per_node=1“ within run.sh , then run it, it can works fine.
There are 6 GPUs in my computer.
CUDA Version 9.0.176
pytorch 1.1.0
python 3.7.3
do you see this issue only in the apex examples or also using a plain PyTorch code?
I just rerun our example and it's working fine on your systems (8x P100).
(base) root@e9f21ccb6520:/workspace/apex/examples/simple/distributed# bash run.sh Selected optimization level O1: Insert automatic casts around Pytorch functions and Tensor methods.
Defaults for this optimization level are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic Processing user overrides (additional kwargs that are not None)... After processing overrides, optimization options are: enabled : True opt_level : O1 cast_model_type : None patch_torch_functions : True keep_batchnorm_fp32 : None master_weights : None loss_scale : dynamic
—————————————————————— The cursor is waiting for the results always... But when I set “--nproc_per_node=1“ within run.sh , then run it, it can works fine. There are 6 GPUs in my computer. CUDA Version 9.0.176 pytorch 1.1.0 python 3.7.3