pretraining hangs on multiple GPUs

Hi! When i'm running modified pretraining scripts with one gpu, training process runs ok However when i'm sets NUM_GPUS_PER_WORKERS=2, script frozes after second "--Start training loop--" message

script for pretraining: modified "ru-gpts/scripts/deepspeed_gpt3_large.sh"

tail of the log:

[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   fp16_enabled ................. True
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   global_rank .................. 0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_accumulation_steps .. 1
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_clipping ............ 0.0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   gradient_predivide_factor .... 1.0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   initial_dynamic_scale ........ 4294967296
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   loss_scale ................... 128
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   memory_breakdown ............. False
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_legacy_fusion ...... False
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_name ............... None
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   optimizer_params ............. None
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print]   pld_enabled .................. False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   pld_params ................... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   prescale_gradients ........... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   scheduler_name ............... None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   scheduler_params ............. None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   sparse_attention ............. None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   sparse_gradients_enabled ..... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   steps_per_print .............. 10
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_enabled .......... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_job_name ......... DeepSpeedJobName
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   tensorboard_output_path ......
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   train_batch_size ............. 1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   train_micro_batch_size_per_gpu  1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   wall_clock_breakdown ......... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   world_size ................... 1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_allow_untested_optimizer  False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_config .................. {
    "stage": 0,
    "contiguous_gradients": false,
    "reduce_scatter": false,
    "reduce_bucket_size": 5.000000e+07,
    "allgather_partitions": true,
    "allgather_bucket_size": 5.000000e+08,
    "overlap_comm": false,
    "load_from_fp32_weights": true,
    "elastic_checkpoint": true,
    "offload_param": null,
    "offload_optimizer": null,
    "sub_group_size": 1.000000e+12,
    "prefetch_bucket_size": 5.000000e+07,
    "param_persistence_threshold": 1.000000e+05,
    "max_live_parameters": 1.000000e+09,
    "max_reuse_distance": 1.000000e+09,
    "gather_fp16_weights_on_model_save": false,
    "find_unused_parameters": false
}
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_enabled ................. False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print]   zero_optimization_stage ...... 0
[2022-01-04 17:51:52,651] [INFO] [config.py:758:print]   json = {
    "train_micro_batch_size_per_gpu": 1,
    "fp16": {
        "enabled": true,
        "loss_scale": 128,
        "loss_scale_window": 2.000000e+03,
        "min_loss_scale": 0.5
    },
    "zero_optimization": {
        "stage": 0,
        "reduce_bucket_size": 5.000000e+07
    }
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
/mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  platform=sys.platform))
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.292543888092041 seconds
Resume train set from iteration 0
--Start training loop--
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:400:forward] Activation Checkpointing Information
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:402:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:405:forward] ----contiguous Memory Checkpointing False with 24 total layers
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:407:forward] ----Synchronization False
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:408:forward] ----Profiling False
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
/mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:

                               !! WARNING !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.

See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

                              !! WARNING !!

  platform=sys.platform))
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.25350284576416016 seconds
--Start training loop--

SYSTEM AND ENV SPECS:

CPU model: Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
OS: CentOS 8
nvcc: command not found :)
pytorch: 1.7.1+cu101
deepspeed: 0.3.16 (tried last)
apex: 0.1 (install from github)
transformers: 3.5.0

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Where should i look, where should debugging take place?

ai-forever / ru-gpts

pretraining hangs on multiple GPUs #85