Hi!
When i'm running modified pretraining scripts with one gpu, training process runs ok
However when i'm sets NUM_GPUS_PER_WORKERS=2, script frozes after second "--Start training loop--" message
script for pretraining:
modified "ru-gpts/scripts/deepspeed_gpt3_large.sh"
tail of the log:
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] fp16_enabled ................. True
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] global_rank .................. 0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] gradient_accumulation_steps .. 1
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] gradient_clipping ............ 0.0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] gradient_predivide_factor .... 1.0
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] initial_dynamic_scale ........ 4294967296
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] loss_scale ................... 128
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] memory_breakdown ............. False
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] optimizer_legacy_fusion ...... False
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] optimizer_name ............... None
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] optimizer_params ............. None
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2022-01-04 17:51:52,650] [INFO] [config.py:751:print] pld_enabled .................. False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] pld_params ................... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] prescale_gradients ........... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] scheduler_name ............... None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] scheduler_params ............. None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] sparse_attention ............. None
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] sparse_gradients_enabled ..... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] steps_per_print .............. 10
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] tensorboard_enabled .......... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] tensorboard_job_name ......... DeepSpeedJobName
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] tensorboard_output_path ......
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] train_batch_size ............. 1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] train_micro_batch_size_per_gpu 1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] wall_clock_breakdown ......... False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] world_size ................... 1
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] zero_allow_untested_optimizer False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] zero_config .................. {
"stage": 0,
"contiguous_gradients": false,
"reduce_scatter": false,
"reduce_bucket_size": 5.000000e+07,
"allgather_partitions": true,
"allgather_bucket_size": 5.000000e+08,
"overlap_comm": false,
"load_from_fp32_weights": true,
"elastic_checkpoint": true,
"offload_param": null,
"offload_optimizer": null,
"sub_group_size": 1.000000e+12,
"prefetch_bucket_size": 5.000000e+07,
"param_persistence_threshold": 1.000000e+05,
"max_live_parameters": 1.000000e+09,
"max_reuse_distance": 1.000000e+09,
"gather_fp16_weights_on_model_save": false,
"find_unused_parameters": false
}
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] zero_enabled ................. False
[2022-01-04 17:51:52,651] [INFO] [config.py:751:print] zero_optimization_stage ...... 0
[2022-01-04 17:51:52,651] [INFO] [config.py:758:print] json = {
"train_micro_batch_size_per_gpu": 1,
"fp16": {
"enabled": true,
"loss_scale": 128,
"loss_scale_window": 2.000000e+03,
"min_loss_scale": 0.5
},
"zero_optimization": {
"stage": 0,
"reduce_bucket_size": 5.000000e+07
}
}
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
/mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
platform=sys.platform))
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.292543888092041 seconds
Resume train set from iteration 0
--Start training loop--
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:400:forward] Activation Checkpointing Information
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:402:forward] ----Partition Activations False, CPU CHECKPOINTING False
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:405:forward] ----contiguous Memory Checkpointing False with 24 total layers
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:407:forward] ----Synchronization False
[2022-01-04 17:51:53,528] [INFO] [checkpointing.py:408:forward] ----Profiling False
Using /root/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
/mnt/work/miniconda3/envs/rugpt/lib/python3.7/site-packages/torch/utils/cpp_extension.py:269: UserWarning:
!! WARNING !!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Your compiler (c++) is not compatible with the compiler Pytorch was
built with for this platform, which is g++ on linux. Please
use g++ to to compile your extension. Alternatively, you may
compile PyTorch from source using c++, and then you can also use
c++ to compile your extension.
See https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md for help
with compiling PyTorch from source.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! WARNING !!
platform=sys.platform))
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.25350284576416016 seconds
--Start training loop--
SYSTEM AND ENV SPECS:
CPU model: Intel(R) Xeon(R) Gold 6234 CPU @ 3.30GHz
OS: CentOS 8
nvcc: command not found :)
pytorch: 1.7.1+cu101
deepspeed: 0.3.16 (tried last)
apex: 0.1 (install from github)
transformers: 3.5.0
nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100S-PCI... Off | 00000000:18:00.0 Off | 0 |
| N/A 41C P0 28W / 250W | 4MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100S-PCI... Off | 00000000:AF:00.0 Off | 0 |
| N/A 38C P0 25W / 250W | 4MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Where should i look, where should debugging take place?
Hi! When i'm running modified pretraining scripts with one gpu, training process runs ok However when i'm sets NUM_GPUS_PER_WORKERS=2, script frozes after second "--Start training loop--" message
script for pretraining: modified "ru-gpts/scripts/deepspeed_gpt3_large.sh"
tail of the log:
SYSTEM AND ENV SPECS:
nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla V100S-PCI... Off | 00000000:18:00.0 Off | 0 | | N/A 41C P0 28W / 250W | 4MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla V100S-PCI... Off | 00000000:AF:00.0 Off | 0 | | N/A 38C P0 25W / 250W | 4MiB / 32510MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Where should i look, where should debugging take place?