Full parameter fine-tuning bugs

I got this bug. bash run.sh [2024-06-20 23:47:57,121] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:00,497] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7: setting --include=localhost:0,1,2,3,4,5,6,7 [2024-06-20 23:48:00,497] [INFO] [runner.py:555:main] cmd = usr/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None usr/LMFlow/examples/finetune.py --model_name_or_path usr/huggingface/hub/LLM-Research/Meta-Llama-3-70B-Instruct --trust_remote_code True --dataset_path usr/LMFlow/data/mbpp/train_conversation --output_dir output_models/mbpp_full --overwrite_output_dir --conversation_template llama3 --num_train_epochs 3 --learning_rate 2e-5 --disable_group_texts 1 --block_size 1024 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --bf16 --run_name mbpp_full --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 32 [2024-06-20 23:48:01,792] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:03,210] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth2 [2024-06-20 23:48:03,210] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]} [2024-06-20 23:48:03,210] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0 [2024-06-20 23:48:03,210] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}) [2024-06-20 23:48:03,210] [INFO] [launch.py:163:main] dist_world_size=8 [2024-06-20 23:48:03,210] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 [2024-06-20 23:48:09,949] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,954] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,954] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,959] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-20 23:48:09,960] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,325] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,325] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,325] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [2024-06-20 23:48:17,326] [INFO] [comm.py:616:init_distributed] cdb=None [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [2024-06-20 23:48:17,326] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) [W CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator()) 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: False 06/20/2024 23:48:17 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: False usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/datasets/load.py:2089: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0. You can remove this warning by passing 'token=None' instead. warnings.warn( [WARNING|logging.py:329] 2024-06-20 23:48:18,533 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,533 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,535 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,535 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,537 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,540 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,544 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:329] 2024-06-20 23:48:18,588 >> You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that thelegacy(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, setlegacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 [WARNING|logging.py:314] 2024-06-20 23:48:18,808 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,811 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,818 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,819 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,825 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,833 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,833 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [WARNING|logging.py:314] 2024-06-20 23:48:18,860 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [2024-06-20 23:48:22,245] [INFO] [partition_parameters.py:326:exit] finished initializing model with 70.55B parameters Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [03:46<00:00, 7.54s/it] Loading checkpoint shards: 100%|██████████████████████████████████████████████| 30/30 [04:24<00:00, 8.80s/it] 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - lmflow.models.hf_decoder_model - Conversation template: ConversationTemplate(user_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>user<|end_header_id|>\n\n{{content}}<|eot_id|>)]), assistant_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>assistant<|end_header_id|>\n\n{{content}}<|eot_id|>)]), system_formatter=StringFormatter(template=[TemplateComponent(type=string, content=<|start_header_id|>system<|end_header_id|>\n\n{{content}}<|eot_id|>)]), tools_formatter=None, separator=None, special_starter=TemplateComponent(type=token, content=bos_token), special_stopper=None, template_name='llama3') 06/20/2024 23:52:48 - WARNING - accelerate.utils.other - Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combinationInstalled CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination

Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja... usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.8547587394714355 seconds Loading extension module cpu_adam... Time to load cpu_adam op: 2.794377326965332 seconds Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja... usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 2.8901524543762207 seconds Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Installed CUDA version 11.2 does not match the version torch was compiled with 11.8 but since the APIs are compatible, accepting this combination Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Using usr/torch_extensions/py39_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file usr/torch_extensions/py39_cu118/cpu_adam/build.ninja... usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/utils/cpp_extension.py:1967: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST']. warnings.warn( Building extension module cpu_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module cpu_adam... Time to load cpu_adam op: 3.5710041522979736 seconds Loading extension module cpu_adam...Loading extension module cpu_adam... Loading extension module cpu_adam...

Loading extension module cpu_adam... Time to load cpu_adam op: 3.6291885375976562 seconds Time to load cpu_adam op: 3.6292471885681152 seconds Time to load cpu_adam op: 3.62994384765625 seconds Time to load cpu_adam op: 3.630335807800293 seconds Parameter Offload: Total persistent parameters: 1318912 in 161 params wandb: Currently logged in as: · (·). Use wandb login --relogin to force relogin wandb: wandb version 0.17.2 is available! To upgrade, please run: wandb: $ pip install wandb --upgrade wandb: Tracking run with wandb version 0.14.0 wandb: Run data is saved locally in usr/LMFlow/wandb/run-20240620_235342-btcnpc4h wandb: Run wandb offline to turn off syncing. wandb: Syncing run mbpp_full wandb: ⭐️ View project at https://wandb.ai/·/huggingface wandb: 🚀 View run at https://wandb.ai/·/huggingface/runs/btcnpc4h 0%| | 0/120 [00:00<?, ?it/s][2024-06-20 23:53:53,238] [WARNING] [parameter_offload.py:86:_apply_to_tensors_only] A module has unknown inputs or outputs type (<class 'transformers.cacheutils.DynamicCache'>) and the tensors embedded in it cannot be detected. The ZeRO-3 hooks designed to trigger before or after backward pass of the module relies on knowing the input and output tensors and therefore may not get triggered properly. usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.runbackward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::broadcast: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py:1252: UserWarning: The torch.cuda.DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=, device='cuda') to create tensors. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905971093/work/torch/csrc/tensor/python_tensor.cpp:78.) total_norm_cuda = get_accelerator().FloatTensor([float(total_norm)]) usr/anaconda3/envs/lmflow/lib/python3.9/site-packages/psutil/init.py:2008: RuntimeWarning: available memory stats couldn't be determined and was set to 0 ret = _psplatform.virtual_memory() [2024-06-20 23:57:07,420] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29901 [2024-06-20 23:57:19,311] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29902 [2024-06-20 23:57:38,331] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29903 [2024-06-20 23:57:52,665] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29904 [2024-06-20 23:57:52,665] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29905 [2024-06-20 23:58:06,209] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29906 [2024-06-20 23:58:19,260] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29907 [2024-06-20 23:58:33,505] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 29971 [2024-06-20 23:58:49,679] [ERROR] [launch.py:321:sigkill_handler] ['usr/anaconda3/envs/lmflow/bin/python', '-u', 'usr/LMFlow/examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'usr/huggingface/hub/LLM-Research/Meta-Llama-3-70B-Instruct', '--trust_remote_code', 'True', '--dataset_path', 'usr/LMFlow/data/mbpp/train_conversation', '--output_dir', 'output_models/mbpp_full', '--overwrite_output_dir', '--conversation_template', 'llama3', '--num_train_epochs', '3', '--learning_rate', '2e-5', '--disable_group_texts', '1', '--block_size', '1024', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--bf16', '--run_name', 'mbpp_full', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '32'] exits with return code = -9 `

And my run_finetune.sh is:

#!/bin/bash
# Please run this script under ${project_id} in project directory of
#   https://github.com/shizhediao/llm-ft
#     COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4
export NCCL_SOCKET_IFNAME=eth2
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Parses arguments
model_name_or_path=usr/huggingface/hub/LLM-Research/Meta-Llama-3-70B-Instruct
dataset_path=usr/LMFlow/data/mbpp/train_conversation
# dataset_path=usr/LMFlow/data/alpaca/train_conversation
output_dir=output_models/mbpp_full
deepspeed_args="--master_port=11000"
conversation_template=llama3

# Safety related arguments
trust_remote_code=True #0

while [[ $# -ge 1 ]]; do
  key="$1"
  case ${key} in
    -m|--model_name_or_path)
      model_name_or_path="$2"
      shift
      ;;
    -d|--dataset_path)
      dataset_path="$2"
      shift
      ;;
    -o|--output_model_path)
      output_dir="$2"
      shift
      ;;
    --conversation_template)
      conversation_template="$2"
      shift
      ;;
    --deepspeed_args)
      deepspeed_args="$2"
      shift
      ;;
    --trust_remote_code)
      trust_remote_code="$2"
      shift
      ;;
    *)
      echo "error: unknown option \"${key}\"" 1>&2
      exit 1
  esac
  shift
done

# Finetune
exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  usr/LMFlow/examples/finetune.py \
    --model_name_or_path ${model_name_or_path} \
    --trust_remote_code ${trust_remote_code} \
    --dataset_path ${dataset_path} \
    --output_dir ${output_dir} --overwrite_output_dir \
    --conversation_template ${conversation_template} \
    --num_train_epochs 3 \
    --learning_rate 2e-5 \
    --disable_group_texts 1 \
    --block_size 1024 \
    --per_device_train_batch_size 1 \
    --deepspeed configs/ds_config_zero3.json \
    --bf16 \
    --run_name mbpp_full \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 32 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

I have two questions：

Why "16-bits training: False"? What else do I need to set up? (Currently only changing 'fp16' to 'bf16')
Why is it that there seems to be a network error at the end that caused kill?

OptimalScale / LMFlow

Full parameter fine-tuning bugs #864