Open SeekPoint opened 1 year ago
I got GPU OOM
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:15:04,883] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-31 17:15:04,892] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:15:06,134] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-05-31 17:15:06,134] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-05-31 17:15:06,134] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-05-31 17:15:06,134] [INFO] [launch.py:247:main] dist_world_size=1 [2023-05-31 17:15:06,134] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-05-31 17:15:07,635] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 3358.13it/s] total samples num: 50 Traceback (most recent call last): File "train.py", line 130, in main() File "train.py", line 99, in main model = get_model(model_config, ds_args, activation_checkpointing_config) File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 167, in get_model print("pp is %d, mp is %d, world_size is:", pp, mp, args.world_size) UnboundLocalError: local variable 'pp' referenced before assignment [2023-05-31 17:15:08,142] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26374 [2023-05-31 17:15:08,143] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1 (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ vim train.py (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ vim models/llama_pipeline_model.py (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:16:32,333] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-31 17:16:32,342] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:16:33,582] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-05-31 17:16:33,582] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-05-31 17:16:33,582] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-05-31 17:16:33,582] [INFO] [launch.py:247:main] dist_world_size=1 [2023-05-31 17:16:33,582] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-05-31 17:16:35,093] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 3368.92it/s] total samples num: 50 pp is %d, mp is %d, world_size is: 1 1 1 SEED_LAYERS=False BASE_SEED=42 SEED_FN=None Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0} [2023-05-31 17:16:35,204] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method parameters stage=0 layers=35 0: EmbeddingPipe 1: ParallelTransformerLayerPipe 2: ParallelTransformerLayerPipe 3: ParallelTransformerLayerPipe 4: ParallelTransformerLayerPipe 5: ParallelTransformerLayerPipe 6: ParallelTransformerLayerPipe 7: ParallelTransformerLayerPipe 8: ParallelTransformerLayerPipe 9: ParallelTransformerLayerPipe 10: ParallelTransformerLayerPipe 11: ParallelTransformerLayerPipe 12: ParallelTransformerLayerPipe 13: ParallelTransformerLayerPipe 14: ParallelTransformerLayerPipe 15: ParallelTransformerLayerPipe 16: ParallelTransformerLayerPipe 17: ParallelTransformerLayerPipe 18: ParallelTransformerLayerPipe 19: ParallelTransformerLayerPipe 20: ParallelTransformerLayerPipe 21: ParallelTransformerLayerPipe 22: ParallelTransformerLayerPipe 23: ParallelTransformerLayerPipe 24: ParallelTransformerLayerPipe 25: ParallelTransformerLayerPipe 26: ParallelTransformerLayerPipe 27: ParallelTransformerLayerPipe 28: ParallelTransformerLayerPipe 29: ParallelTransformerLayerPipe 30: ParallelTransformerLayerPipe 31: ParallelTransformerLayerPipe 32: ParallelTransformerLayerPipe 33: LayerNormPipe 34: LMLayerPipe loss: loss_fn Traceback (most recent call last): File "train.py", line 130, in main() File "train.py", line 99, in main model = get_model(model_config, ds_args, activation_checkpointing_config) File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 182, in get_model return GPT2ModelPipe(model_config, File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 157, in init super().init( File "/home/amd00/.local/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 200, in init self.to(get_accelerator().device_name(self.local_rank)) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to return self._apply(convert) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply module._apply(fn) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply param_applied = fn(param) File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 22.83 GiB already allocated; 97.88 MiB free; 22.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-05-31 17:17:30,649] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26532 [2023-05-31 17:17:30,650] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1 (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$
@SeekPoint Hi, I have 3090 card to finetune model and also met the "cuda out of memory" problem, have you solved this problem?
not yet
I got GPU OOM
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ (gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:15:04,883] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-05-31 17:15:04,892] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json [2023-05-31 17:15:06,134] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-05-31 17:15:06,134] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-05-31 17:15:06,134] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-05-31 17:15:06,134] [INFO] [launch.py:247:main] dist_world_size=1 [2023-05-31 17:15:06,134] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-05-31 17:15:07,635] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 3358.13it/s] total samples num: 50 Traceback (most recent call last): File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 167, in get_model
print("pp is %d, mp is %d, world_size is:", pp, mp, args.world_size)
UnboundLocalError: local variable 'pp' referenced before assignment
[2023-05-31 17:15:08,142] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26374
[2023-05-31 17:15:08,143] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ vim train.py
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ vim models/llama_pipeline_model.py
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$ deepspeed --include localhost:0 --master_port 22384 train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:16:32,333] [WARNING] [runner.py:191:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-05-31 17:16:32,342] [INFO] [runner.py:541:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=22384 --enable_each_rank_log=None train.py --output_dir out_dir --init_ckpt llama-7b-init-ckpt/ --data_path ./data/alpaca_data_sample_oneline_format.json --max_seq_len 8 --train_steps 1000 --eval_steps 10 --save_steps 200 --log_steps 1 --pipe_parallel_size 1 --model_parallel_size 1 --use_flash_attn false --deepspeed_config ./configs/ds_config_zero1.json
[2023-05-31 17:16:33,582] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-05-31 17:16:33,582] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-05-31 17:16:33,582] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-05-31 17:16:33,582] [INFO] [launch.py:247:main] dist_world_size=1
[2023-05-31 17:16:33,582] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-05-31 17:16:35,093] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1/1 [00:00<00:00, 3368.92it/s]
total samples num: 50
pp is %d, mp is %d, world_size is: 1 1 1
SEED_LAYERS=False BASE_SEED=42 SEED_FN=None
Using topology: {ProcessCoord(pipe=0, data=0, model=0): 0}
[2023-05-31 17:16:35,204] [INFO] [module.py:358:_partition_layers] Partitioning pipeline stages with method parameters
stage=0 layers=35
0: EmbeddingPipe
1: ParallelTransformerLayerPipe
2: ParallelTransformerLayerPipe
3: ParallelTransformerLayerPipe
4: ParallelTransformerLayerPipe
5: ParallelTransformerLayerPipe
6: ParallelTransformerLayerPipe
7: ParallelTransformerLayerPipe
8: ParallelTransformerLayerPipe
9: ParallelTransformerLayerPipe
10: ParallelTransformerLayerPipe
11: ParallelTransformerLayerPipe
12: ParallelTransformerLayerPipe
13: ParallelTransformerLayerPipe
14: ParallelTransformerLayerPipe
15: ParallelTransformerLayerPipe
16: ParallelTransformerLayerPipe
17: ParallelTransformerLayerPipe
18: ParallelTransformerLayerPipe
19: ParallelTransformerLayerPipe
20: ParallelTransformerLayerPipe
21: ParallelTransformerLayerPipe
22: ParallelTransformerLayerPipe
23: ParallelTransformerLayerPipe
24: ParallelTransformerLayerPipe
25: ParallelTransformerLayerPipe
26: ParallelTransformerLayerPipe
27: ParallelTransformerLayerPipe
28: ParallelTransformerLayerPipe
29: ParallelTransformerLayerPipe
30: ParallelTransformerLayerPipe
31: ParallelTransformerLayerPipe
32: ParallelTransformerLayerPipe
33: LayerNormPipe
34: LMLayerPipe
loss: loss_fn
Traceback (most recent call last):
File "train.py", line 130, in
main()
File "train.py", line 99, in main
model = get_model(model_config, ds_args, activation_checkpointing_config)
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 182, in get_model
return GPT2ModelPipe(model_config,
File "/home/amd00/llm_dev/llama-deepspeed/models/llama_pipeline_model.py", line 157, in init
super().init(
File "/home/amd00/.local/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 200, in init
self.to(get_accelerator().device_name(self.local_rank))
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 987, in to
return self._apply(convert)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 639, in _apply
module._apply(fn)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 662, in _apply
param_applied = fn(param)
File "/home/amd00/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 985, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 23.70 GiB total capacity; 22.83 GiB already allocated; 97.88 MiB free; 22.83 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2023-05-31 17:17:30,649] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 26532
[2023-05-31 17:17:30,650] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'train.py', '--local_rank=0', '--output_dir', 'out_dir', '--init_ckpt', 'llama-7b-init-ckpt/', '--data_path', './data/alpaca_data_sample_oneline_format.json', '--max_seq_len', '8', '--train_steps', '1000', '--eval_steps', '10', '--save_steps', '200', '--log_steps', '1', '--pipe_parallel_size', '1', '--model_parallel_size', '1', '--use_flash_attn', 'false', '--deepspeed_config', './configs/ds_config_zero1.json'] exits with return code = 1
(gh_llama-deepspeed) amd00@asus00:~/llm_dev/llama-deepspeed$