MARIO-Math-Reasoning / Super_MARIO

MIT License
173 stars 13 forks source link

type of template for training #17

Closed vgaraujov closed 1 month ago

vgaraujov commented 1 month ago

Hello!

Thanks for sharing the details of your implementation. I'm wondering what llama factory template you used for your fine tuning, alpaca or deepseek or maybe a custom one?

Also did you fine tune the full LLM?

Chen-GX commented 1 month ago

Yes, we fine tune the full LLM.

For template, we use vanilla as the template. Here is the code in detail.

_register_template(
    name="vanilla",
    format_separator=EmptyFormatter(slots=["\n"]),
    efficient_eos=True,
)

But I have noticed that, the template vanilla has been removed in the latest version of llama factory. You can add this template or use other template depend on your base model.

vgaraujov commented 1 month ago

Hey @Chen-GX, thanks for your prompt answer.

I have an additional question. I'm trying to replicate your training process. I can already train the model, but I can't fit more than batch size 10 on an 80GB A100 GPU, so I can't reach the reported batch size of 1024 with 8xA100.

I'm already using bf16, deepspeed, and flash attention. Is 1024 reached with gradient accumulation? Did you decrease cutoff_len?

Chen-GX commented 1 month ago

For fine-tuning full 7B LLM, you should fine tune the model on (at least) 4 * A100(80G).

Do you use the stage 2 for deepspeed?

Here some parameters for your reference:

GPU_NUM=8
per_device_train_batch_size=8
gradient_accumulation_steps=16

--max_length 1024 
--cutoff_len 1024 
vgaraujov commented 1 month ago

Thank @Chen-GX . Yeah I think I have to use gradient accumulation. I am using stage 3 for deepspeed, with stage 2 I run out of memory. I have not checked why. I suppose using stage 3 could lead to similar result, right?

Chen-GX commented 1 month ago

I have not tried stage 3. But I think the result is similar. I am curious about why stage 2 will result in OOM. I have always used stage 2 for training and there is no problem.

vgaraujov commented 1 month ago

@Chen-GX I can confirm that stage 3 leads to similar results :). I haven't explored stage 2 and the OOM issue yet.

On the other hand, I wonder if for your Llama3 experiments you used the llama template or also the vanilla one?

Chen-GX commented 1 month ago

Congratulations👏. I use vanilla for llama3.