Closed vgaraujov closed 1 month ago
Yes, we fine tune the full LLM.
For template, we use vanilla
as the template.
Here is the code in detail.
_register_template(
name="vanilla",
format_separator=EmptyFormatter(slots=["\n"]),
efficient_eos=True,
)
But I have noticed that, the template vanilla
has been removed in the latest version of llama factory. You can add this template or use other template depend on your base model.
Hey @Chen-GX, thanks for your prompt answer.
I have an additional question. I'm trying to replicate your training process. I can already train the model, but I can't fit more than batch size 10 on an 80GB A100 GPU, so I can't reach the reported batch size of 1024 with 8xA100.
I'm already using bf16
, deepspeed
, and flash attention
. Is 1024 reached with gradient accumulation? Did you decrease cutoff_len
?
For fine-tuning full 7B LLM, you should fine tune the model on (at least) 4 * A100(80G).
Do you use the stage 2 for deepspeed?
Here some parameters for your reference:
GPU_NUM=8
per_device_train_batch_size=8
gradient_accumulation_steps=16
--max_length 1024
--cutoff_len 1024
Thank @Chen-GX . Yeah I think I have to use gradient accumulation. I am using stage 3 for deepspeed, with stage 2 I run out of memory. I have not checked why. I suppose using stage 3 could lead to similar result, right?
I have not tried stage 3. But I think the result is similar. I am curious about why stage 2 will result in OOM. I have always used stage 2 for training and there is no problem.
@Chen-GX I can confirm that stage 3 leads to similar results :). I haven't explored stage 2 and the OOM issue yet.
On the other hand, I wonder if for your Llama3 experiments you used the llama
template or also the vanilla
one?
Congratulations👏. I use vanilla
for llama3.
Hello!
Thanks for sharing the details of your implementation. I'm wondering what llama factory template you used for your fine tuning,
alpaca
ordeepseek
or maybe a custom one?Also did you fine tune the full LLM?