Open ykallan opened 4 months ago
后面把微调的参数调整一下:
args = TrainingArguments(
output_dir="./output/llama3",
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
logging_steps=10,
num_train_epochs=16,
save_steps=300,
learning_rate=1e-4,
save_on_each_node=True,
gradient_checkpointing=True,
fp16=True, # 放开这里
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_id,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True),
)
trainer.train()
在训练的时候会报错:
C:\ProgramData\miniconda3\envs\llama\lib\site-packages\accelerate\accelerator.py:446: FutureWarning: Passing the following arguments to `Accelerator` is deprecated and will be removed in version 1.0 of Accelerate: dict_keys(['dispatch_batches', 'split_batches']). Please pass an `accelerate.DataLoaderConfiguration` instead:
dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False)
warnings.warn(
0%| | 0/10000 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
C:\ProgramData\miniconda3\envs\llama\lib\site-packages\torch\utils\checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
C:\ProgramData\miniconda3\envs\llama\lib\site-packages\transformers\models\llama\modeling_llama.py:728: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.)
attn_output = torch.nn.functional.scaled_dot_product_attention(
Traceback (most recent call last):
File "D:\codes\llm_about\self-llm\zzzzz_train\llama38B\finetune_llama3_8b.py", line 95, in <module>
trainer.train()
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\transformers\trainer.py", line 1539, in train
return inner_training_loop(
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\transformers\trainer.py", line 1911, in _inner_training_loop
self.accelerator.clip_grad_norm_(
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\accelerate\accelerator.py", line 2269, in clip_grad_norm_
self.unscale_gradients()
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\accelerate\accelerator.py", line 2219, in unscale_gradients
self.scaler.unscale_(opt)
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\torch\amp\grad_scaler.py", line 337, in unscale_
optimizer_state["found_inf_per_device"] = self._unscale_grads_(
File "C:\ProgramData\miniconda3\envs\llama\lib\site-packages\torch\amp\grad_scaler.py", line 278, in _unscale_grads_
torch._amp_foreach_non_finite_check_and_unscale_(
RuntimeError: "_amp_foreach_non_finite_check_and_unscale_cuda" not implemented for 'BFloat16'
0%| | 0/10000 [00:02<?, ?it/s]
显卡使用的是3090,cuda和cudnn更新到最新版12.1
nvcc -V:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:36:15_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0
nvidia-smi:
Sun May 26 00:17:50 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.85 Driver Version: 555.85 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 WDDM | 00000000:01:00.0 On | N/A |
| 50% 38C P8 19W / 350W | 1076MiB / 24576MiB | 9% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
loss为0,可能是peft和transformers包不匹配的问题,可以尝试都升级到最新版本。最后这个报错,可能是3090不支持bf16训练,但支持bf16加载模型,也有可能是windows的锅。
如果上述尝试都不成功的话,可以试试在autodl使用教程里面提供的镜像,跑一下看看。
windows上面出现问题,很难定位的问题~
我也遇到了这个问题,用的是qw7那个示例,v100,cuda11.8,centos系统
我也遇到了这个问题,用的deepseek的示例,10步以后的loss都为0,微调后跑推理输出全是:"!!!!!!!!!!!!!!!!!!!!!!!!"
tokenizer.pad_token = tokenizer.eos_token 为啥需要这样啊
问题描述:
使用peft微调llama3 8b,训练代码基本是按照样例稍作修改,在训练的时候 前10个steps,loss稍高,后面输出的loss,一直都是0.0了
微调代码:
包版本:
日志输出: