Closed tengxiaoliu closed 4 months ago
I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp.
Here is the package info:
python==3.10.13 deepspeed==0.12.6 transformers==4.30.2
Collie config:
config = CollieConfig.from_pretrained("meta-llama/Llama-2-7b-hf") config.dp_size = 1 config.pp_size = 1 config.tp_size = 1 config.train_epochs = args.train_epochs config.train_micro_batch_size = 1 config.gradient_accumulation_steps = 2 config.eval_batch_size = 1 config.eval_per_n_epochs = 1 config.use_flash = False config.ds_config = { "fp16": { "enabled": True }, "monitor_config": { "enabled": True, "tag": f"{args.tag}_ep{args.train_epochs}", "wandb": { "enabled": False, } }, "zero_optimization": { "stage": 2, }, }
The process gets stuck in the following state:
[2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False [2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer [2024-01-13 09:24:02,401] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer [2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW [2024-01-13 09:24:02,418] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'> [2024-01-13 09:24:02,418] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float16 ZeRO stage 2 optimizer [2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000 [2024-01-13 09:24:02,418] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000 [2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: False [2024-01-13 09:24:02,419] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False [2024-01-13 09:24:28,621] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states [2024-01-13 09:24:28,622] [INFO] [utils.py:792:see_memory_usage] MA 15.69 GB Max_MA 17.26 GB CA 17.26 GB Max_CA 17 GB [2024-01-13 09:24:28,622] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 23.05 GB, percent = 3.1%
close because can't repro
I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp.
Here is the package info:
Collie config:
The process gets stuck in the following state: