[BUG] 请问Trainer.train()挂起该从哪排查呢？

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

正常加载了checkpoint，读取数据后没有正常训练。会和kernel version有关吗？还是只是warning呢？训练方式是：finetune_lora_single_gpu.sh

2024-03-14T06:35:42.275070377Z [2024-03-14 14:35:42,274] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) 2024-03-14T06:35:42.447084966Z /usr/local/anaconda3/lib/python3.9/site-packages/torch/cuda/init.py:141: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11000). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org/ to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) 2024-03-14T06:35:42.447121925Z return torch._C._cuda_getDeviceCount() > 0 2024-03-14T06:35:42.682857922Z /usr/local/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:260: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. 2024-03-14T06:35:42.682909278Z torch.utils._pytree._register_pytree_node( 2024-03-14T06:35:43.031156965Z /usr/local/anaconda3/lib/python3.9/site-packages/transformers/utils/generic.py:260: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. 2024-03-14T06:35:43.031182253Z torch.utils._pytree._register_pytree_node( 2024-03-14T06:36:54.492090863Z Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s] Loading checkpoint shards: 10%|█ | 1/10 [00:01<00:13, 1.52s/it] Loading checkpoint shards: 20%|██ | 2/10 [00:02<00:11, 1.45s/it] Loading checkpoint shards: 30%|███ | 3/10 [00:04<00:10, 1.47s/it] Loading checkpoint shards: 40%|████ | 4/10 [00:05<00:08, 1.48s/it] Loading checkpoint shards: 50%|█████ | 5/10 [00:07<00:07, 1.49s/it] Loading checkpoint shards: 60%|██████ | 6/10 [00:08<00:05, 1.50s/it] Loading checkpoint shards: 70%|███████ | 7/10 [00:10<00:04, 1.50s/it] Loading checkpoint shards: 80%|████████ | 8/10 [00:12<00:03, 1.53s/it] Loading checkpoint shards: 90%|█████████ | 9/10 [00:14<00:01, 1.68s/it] Loading checkpoint shards: 100%|██████████| 10/10 [00:15<00:00, 1.58s/it] Loading checkpoint shards: 100%|██████████| 10/10 [00:15<00:00, 1.54s/it] 2024-03-14T06:37:02.118424786Z /usr/local/anaconda3/lib/python3.9/site-packages/accelerate/accelerator.py:432: FutureWarning: Passing the following arguments to Accelerator is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches']). Please pass an accelerate.DataLoaderConfiguration instead: 2024-03-14T06:37:02.118469350Z dataloader_config = DataLoaderConfiguration(dispatch_batches=None) 2024-03-14T06:37:02.118472125Z warnings.warn( 2024-03-14T06:37:02.118848645Z Detected kernel version 4.14.105, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 2024-03-14T06:37:02.501121965Z Loading data... 2024-03-14T06:37:02.501156470Z finetune/finetune_data_input.json 2024-03-14T06:37:02.501159295Z Formatting inputs...Skip in lazy mode 2024-03-14T06:37:02.501161189Z Load data done 2024-03-14T06:37:02.501163182Z Start trainner init ... 2024-03-14T06:37:02.501164855Z training_args is ... 2024-03-14T06:37:02.501166619Z TrainingArguments( 2024-03-14T06:37:02.501168552Z _n_gpu=1, 2024-03-14T06:37:02.501170276Z adafactor=False, 2024-03-14T06:37:02.501172029Z adam_beta1=0.9, 2024-03-14T06:37:02.501174013Z adam_beta2=0.95, 2024-03-14T06:37:02.501175766Z adam_epsilon=1e-08, 2024-03-14T06:37:02.501177740Z auto_find_batch_size=False, 2024-03-14T06:37:02.501179814Z bf16=True, 2024-03-14T06:37:02.501182158Z bf16_full_eval=False, 2024-03-14T06:37:02.501184392Z cache_dir=/data/finetune/deepspeed_cache/cache, 2024-03-14T06:37:02.501187147Z data_seed=None, 2024-03-14T06:37:02.501189772Z dataloader_drop_last=False, 2024-03-14T06:37:02.501191906Z dataloader_num_workers=0, 2024-03-14T06:37:02.501193650Z dataloader_pin_memory=True, 2024-03-14T06:37:02.501195423Z ddp_backend=None, 2024-03-14T06:37:02.501197597Z ddp_broadcast_buffers=None, 2024-03-14T06:37:02.501199962Z ddp_bucket_cap_mb=None, 2024-03-14T06:37:02.501202176Z ddp_find_unused_parameters=None, 2024-03-14T06:37:02.501204550Z ddp_timeout=1800, 2024-03-14T06:37:02.501218036Z debug=[], 2024-03-14T06:37:02.501220029Z deepspeed=None, 2024-03-14T06:37:02.501221823Z disable_tqdm=False, 2024-03-14T06:37:02.501223446Z dispatch_batches=None, 2024-03-14T06:37:02.501225069Z do_eval=False, 2024-03-14T06:37:02.501226652Z do_predict=False, 2024-03-14T06:37:02.501228445Z do_train=False, 2024-03-14T06:37:02.501230018Z eval_accumulation_steps=None, 2024-03-14T06:37:02.501231842Z eval_delay=0, 2024-03-14T06:37:02.501233445Z eval_steps=None, 2024-03-14T06:37:02.501235278Z evaluation_strategy=no, 2024-03-14T06:37:02.501236891Z fix_vit=True, 2024-03-14T06:37:02.501238675Z fp16=False, 2024-03-14T06:37:02.501240268Z fp16_backend=auto, 2024-03-14T06:37:02.501242061Z fp16_full_eval=False, 2024-03-14T06:37:02.501243674Z fp16_opt_level=O1, 2024-03-14T06:37:02.501245277Z fsdp=[], 2024-03-14T06:37:02.501246930Z fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, 2024-03-14T06:37:02.501248663Z fsdp_min_num_params=0, 2024-03-14T06:37:02.501250306Z fsdp_transformer_layer_cls_to_wrap=None, 2024-03-14T06:37:02.501252140Z full_determinism=False, 2024-03-14T06:37:02.501253763Z gradient_accumulation_steps=8, 2024-03-14T06:37:02.501255757Z gradient_checkpointing=True, 2024-03-14T06:37:02.501257560Z greater_is_better=None, 2024-03-14T06:37:02.501259374Z group_by_length=False, 2024-03-14T06:37:02.501261167Z half_precision_backend=auto, 2024-03-14T06:37:02.501262780Z hub_always_push=False, 2024-03-14T06:37:02.501264814Z hub_model_id=None, 2024-03-14T06:37:02.501266417Z hub_private_repo=False, 2024-03-14T06:37:02.501268040Z hub_strategy=every_save, 2024-03-14T06:37:02.501269853Z hub_token=, 2024-03-14T06:37:02.501271847Z ignore_data_skip=False, 2024-03-14T06:37:02.501273861Z include_inputs_for_metrics=False, 2024-03-14T06:37:02.501275514Z jit_mode_eval=False, 2024-03-14T06:37:02.501277317Z label_names=None, 2024-03-14T06:37:02.501278930Z label_smoothing_factor=0.0, 2024-03-14T06:37:02.501280744Z learning_rate=1e-05, 2024-03-14T06:37:02.501282537Z length_column_name=length, 2024-03-14T06:37:02.501284361Z load_best_model_at_end=False, 2024-03-14T06:37:02.501286014Z local_rank=0, 2024-03-14T06:37:02.501287607Z log_level=passive, 2024-03-14T06:37:02.501289200Z log_level_replica=warning, 2024-03-14T06:37:02.501290833Z log_on_each_node=True, 2024-03-14T06:37:02.501292466Z logging_dir=output_qwen/runs/Mar14_14-35-43_pytorch-404379769-master-0, 2024-03-14T06:37:02.501295962Z logging_first_step=False, 2024-03-14T06:37:02.501297826Z logging_nan_inf_filter=True, 2024-03-14T06:37:02.501299459Z logging_steps=1.0, 2024-03-14T06:37:02.501301072Z logging_strategy=steps, 2024-03-14T06:37:02.501302936Z lr_scheduler_type=cosine, 2024-03-14T06:37:02.501304759Z max_grad_norm=1.0, 2024-03-14T06:37:02.501306572Z max_steps=-1, 2024-03-14T06:37:02.501308155Z metric_for_best_model=None, 2024-03-14T06:37:02.501309789Z model_max_length=2048, 2024-03-14T06:37:02.501311402Z mp_parameters=, 2024-03-14T06:37:02.501313205Z no_cuda=False, 2024-03-14T06:37:02.501314808Z num_train_epochs=5.0, 2024-03-14T06:37:02.501316671Z optim=adamw_torch, 2024-03-14T06:37:02.501318475Z optim_args=None, 2024-03-14T06:37:02.501320088Z output_dir=output_qwen, 2024-03-14T06:37:02.501321901Z overwrite_output_dir=False, 2024-03-14T06:37:02.501323564Z past_index=-1, 2024-03-14T06:37:02.501326049Z per_device_eval_batch_size=1, 2024-03-14T06:37:02.501327692Z per_device_train_batch_size=1, 2024-03-14T06:37:02.501329556Z prediction_loss_only=False, 2024-03-14T06:37:02.501331179Z push_to_hub=False, 2024-03-14T06:37:02.501332992Z push_to_hub_model_id=None, 2024-03-14T06:37:02.501334595Z push_to_hub_organization=None, 2024-03-14T06:37:02.501336439Z push_to_hub_token=, 2024-03-14T06:37:02.501338673Z ray_scope=last, 2024-03-14T06:37:02.501340456Z remove_unused_columns=True, 2024-03-14T06:37:02.501342079Z report_to=[], 2024-03-14T06:37:02.501343873Z resume_from_checkpoint=None, 2024-03-14T06:37:02.501345476Z run_name=output_qwen, 2024-03-14T06:37:02.501347279Z save_on_each_node=False, 2024-03-14T06:37:02.501348902Z save_safetensors=False, 2024-03-14T06:37:02.501350696Z save_steps=1000, 2024-03-14T06:37:02.501352309Z save_strategy=steps, 2024-03-14T06:37:02.501353922Z save_total_limit=10, 2024-03-14T06:37:02.501355555Z seed=42, 2024-03-14T06:37:02.501357148Z sharded_ddp=[], 2024-03-14T06:37:02.501358741Z skip_memory_metrics=True, 2024-03-14T06:37:02.501360554Z tf32=None, 2024-03-14T06:37:02.501362147Z torch_compile=False, 2024-03-14T06:37:02.501363951Z torch_compile_backend=None, 2024-03-14T06:37:02.501365584Z torch_compile_mode=None, 2024-03-14T06:37:02.501367427Z torchdynamo=None, 2024-03-14T06:37:02.501369050Z tpu_metrics_debug=False, 2024-03-14T06:37:02.501372326Z tpu_num_cores=None, 2024-03-14T06:37:02.501374070Z use_cpu=False, 2024-03-14T06:37:02.501375873Z use_ipex=False, 2024-03-14T06:37:02.501377476Z use_legacy_prediction_loop=False, 2024-03-14T06:37:02.501379310Z use_lora=True, 2024-03-14T06:37:02.501380913Z use_mps_device=False, 2024-03-14T06:37:02.501382726Z warmup_ratio=0.01, 2024-03-14T06:37:02.501384349Z warmup_steps=0, 2024-03-14T06:37:02.501386142Z weight_decay=0.1, 2024-03-14T06:37:02.501387765Z ) 2024-03-14T06:43:08.101088952Z 0%| | 0/625 [00:00<?, ?it/s]/usr/local/anaconda3/lib/python3.9/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. 2024-03-14T06:43:08.101143535Z warnings.warn(

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

Python Version: 3.9.12
PyTorch Version: 2.2.1+cu121
PyTorch CUDA Version: 12.1
transformers: 4.32.0

备注 | Anything else?

No response

QwenLM / Qwen