Open Chopinxb opened 1 year ago
scripts/lora/lora.sh has problem ?
errors logs:
[INFO] date:2023-08-14 21:09:52 [W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [2023-08-14 21:09:57,184] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /models/WizardCoder-15B-V1.0 [2023-08-14 21:09:59,612] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-14 21:09:59,612] [INFO] [comm.py:616:init_distributed] cdb=None [2023-08-14 21:09:59,612] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). WARNING:root:Process rank: 0, device: cuda:0, n_gpu: 1 WARNING:root:distributed training: True, 16-bits training: False WARNING:root:Training parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/ds_config/zero3_auto.json, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, full_finetune=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=8, gradient_checkpointing=True, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=adapter/runs/Aug14_21-09-59_vipdata-gpu-108-236.serving.ai.paas, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=10000, metric_for_best_model=None, model_max_length=2048, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=4, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=adapter, sample_generate=False, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=5, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, train_on_source=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) device_map: {'': 0} Loading Model from /models/Baichuan-13B-Chat... /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/configuration_utils.py:483: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Traceback (most recent call last): File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 310, in <module> train() File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 261, in train model, tokenizer = load_model_tokenizer(args=args) File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 169, in load_model_tokenizer model = AutoModelForCausalLM.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained return model_class.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2247, in from_pretrained raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881893) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/chopin/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_lora.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-14_21:10:04 host : vipdata-gpu-108-236.serving.ai.paas rank : 0 (local_rank: 0) exitcode : 1 (pid: 881893) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ finished
scripts:
CUDA_VISIBLE_DEVICES=3,4,5 torchrun --nproc_per_node=1 train_lora.py \ --model_name_or_path /models/Baichuan-13B-Chat \ --dataset_name spider \ --output_dir adapter \ --lora_target_modules W_pack \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 5 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --optim "adamw_torch" \ --lr_scheduler_type "cosine" \ --model_max_length 2048 \ --logging_steps 1 \ --do_train \ --do_eval \ --trust_remote_code \ --gradient_checkpointing True \ --deepspeed "scripts/ds_config/zero3_auto.json"
scripts/lora/lora.sh has problem ?
scripts/lora/lora.sh has problem ?
errors logs:
[INFO] date:2023-08-14 21:09:52 [W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [2023-08-14 21:09:57,184] [INFO] [real_accelerator.py:133:get_accelerator] Setting ds_accelerator to cuda (auto detect) /models/WizardCoder-15B-V1.0 [2023-08-14 21:09:59,612] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented [2023-08-14 21:09:59,612] [INFO] [comm.py:616:init_distributed] cdb=None [2023-08-14 21:09:59,612] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). [W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol). WARNING:root:Process rank: 0, device: cuda:0, n_gpu: 1 WARNING:root:distributed training: True, 16-bits training: False WARNING:root:Training parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, cache_dir=None, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=scripts/ds_config/zero3_auto.json, disable_tqdm=False, do_eval=True, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, full_finetune=False, generation_config=None, generation_max_length=None, generation_num_beams=None, gradient_accumulation_steps=8, gradient_checkpointing=True, greater_is_better=None, group_by_length=True, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=2e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=adapter/runs/Aug14_21-09-59_vipdata-gpu-108-236.serving.ai.paas, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=1, logging_strategy=steps, lr_scheduler_type=cosine, max_grad_norm=0.3, max_steps=10000, metric_for_best_model=None, model_max_length=2048, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_torch, optim_args=None, output_dir=adapter, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=4, per_device_train_batch_size=4, predict_with_generate=False, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=adapter, sample_generate=False, save_on_each_node=False, save_safetensors=False, save_steps=500, save_strategy=steps, save_total_limit=5, seed=42, sharded_ddp=[], skip_memory_metrics=True, sortish_sampler=False, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, train_on_source=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.03, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) device_map: {'': 0} Loading Model from /models/Baichuan-13B-Chat... /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/configuration_utils.py:483: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( /home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py:2193: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers. warnings.warn( Traceback (most recent call last): File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 310, in <module> train() File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 261, in train model, tokenizer = load_model_tokenizer(args=args) File "/home/chopin/code/DB-GPT-Hub/train_lora.py", line 169, in load_model_tokenizer model = AutoModelForCausalLM.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 488, in from_pretrained return model_class.from_pretrained( File "/home/chopin/miniconda3/envs/ft/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2247, in from_pretrained raise ValueError( ValueError: DeepSpeed Zero-3 is not compatible with `low_cpu_mem_usage=True` or with passing a `device_map`. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 881893) of binary: /usr/local/bin/python3.10 Traceback (most recent call last): File "/home/chopin/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/chopin/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ train_lora.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-14_21:10:04 host : vipdata-gpu-108-236.serving.ai.paas rank : 0 (local_rank: 0) exitcode : 1 (pid: 881893) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ finished
scripts:
CUDA_VISIBLE_DEVICES=3,4,5 torchrun --nproc_per_node=1 train_lora.py \ --model_name_or_path /models/Baichuan-13B-Chat \ --dataset_name spider \ --output_dir adapter \ --lora_target_modules W_pack \ --num_train_epochs 3 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 500 \ --save_total_limit 5 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --optim "adamw_torch" \ --lr_scheduler_type "cosine" \ --model_max_length 2048 \ --logging_steps 1 \ --do_train \ --do_eval \ --trust_remote_code \ --gradient_checkpointing True \ --deepspeed "scripts/ds_config/zero3_auto.json"
scripts/lora/lora.sh has problem ?
please help identify where the issue lies? I encountered some parameter-related issues when using the original script, so I modified these parameters: [--trust_remote_code , --dataset_name spider ] 。
I attempted to modify the 'train_lora.py' script and commented out the following line of code, and then the error was gone.
AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
# device_map=device_map,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
)
if args.q_lora
else None,
torch_dtype=compute_dtype,
**config_kwargs,
)
errors logs:
scripts: