QwenLM / Qwen2

Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.
7.29k stars 440 forks source link

用官方JSON微调完后输出的目录为空,没有文件,是怎么一回事呢,谢谢 #744

Closed heiheiheibj closed 2 days ago

heiheiheibj commented 2 months ago

python finetune.py --model_name_or_path /mnt/k/Qwen2/Qwen2-1.5B-Instruct --data_path example_data.jsonl --bf16 False --output_dir output_qwen --num_train_epochs 5 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 10 --save_total_limit 10 --learning_rate 3e-4 --weight_decay 0.01 --adam_beta2 0.95 --warmup_ratio 0.01 --lr_scheduler_type "cosine" --logging_steps 1 --report_to "none" --model_max_length 512 --lazy_preprocess True --use_lora False --q_lora False --gradient_checkpointing --deepspeed ds_config_zero3.json

[2024-07-04 02:10:17,055] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it. [2024-07-04 02:10:17,071] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect) /root/anaconda3/envs/qanything/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations warnings.warn( /root/anaconda3/envs/qanything/lib/python3.10/site-packages/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( model_args= /mnt/k/Qwen2/Qwen2-1.5B-Instruct [2024-07-04 02:10:25,681] [INFO] [comm.py:637:init_distributed] cdb=None [2024-07-04 02:10:25,682] [INFO] [comm.py:652:init_distributed] Not using the DeepSpeed or dist launchers, attempting to detect MPI environment... [2024-07-04 02:10:25,749] [INFO] [comm.py:702:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=172.24.252.210, master_port=29500 [2024-07-04 02:10:25,749] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend gloo Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/deepspeed_shm_comm/build.ninja... Building extension module deepspeed_shm_comm... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module deepspeed_shm_comm... Time to load deepspeed_shm_comm op: 0.11154341697692871 seconds DeepSpeed deepspeed.ops.comm.deepspeed_shm_comm_op built successfully [2024-07-04 02:10:27,630] [INFO] [partition_parameters.py:345:exit] finished initializing model - num_params = 339, num_elems = 1.78B Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading data... Formatting inputs...Skip in lazy mode Using /root/.cache/torch_extensions/py310_cu121 as PyTorch extensions root... Emitting ninja build file /root/.cache/torch_extensions/py310_cu121/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 2.466430902481079 seconds Parameter Offload: Total persistent parameters: 144896 in 141 params Killed

output_qwen 目录有创建,但里面没有内容 谢谢

jklj077 commented 2 months ago

The log you've shared indicates that your training job encountered an issue and was terminated before completion. Here's an analysis of the log and potential solutions for the issues identified:

  1. The log shows that the training is set to use the CPU rather than a GPU or other accelerator:

    Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.

    Ensure that your environment has access to GPUs and that they are properly configured. Check your CUDA versions, and confirm that PyTorch can see your GPU(s). You can do this by running nvidia-smi to check if your GPUs are recognized and python -c "import torch; print(torch.cuda.is_available())" to verify if PyTorch can access them.

    The bash script should help you set up distributed environment.

  2. The last line of the log indicates that the process was killed:

    Killed

    This could be due to insufficient memory, reaching resource limits, or other system constraints. Monitor your system resources during the training process. Increase the available memory if possible, or reduce the batch size and other resource-intensive settings.

To summarize, address the issues mentioned above, particularly focusing on ensuring that your GPU is properly configured and that your system has enough resources to handle the training job.

heiheiheibj commented 2 months ago

Thank you very much for your reply. I am CPU only without GPU, RAM is 32G. can CPU run fine tuning please?

jklj077 commented 2 months ago

I don't think so. even if it could, it is not recommended.

github-actions[bot] commented 1 week ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.