-
Just wanted to let you know that I have made a more generic implementation for GA, which wraps around the entire model, without having to modify the optimizer itself. Very simple concept and easy to i…
-
Will be good to be implemented
-
import logging
import os
import json
import torch
from datasets import load_from_disk
from transformers import TrainingArguments
from trl import SFTTrainer
from unsloth import FastLanguageModel…
-
Is there currently support for gradient accumulation? If not, do you have any hints on how/where I can implement it in this project?
-
### Feature request
🤗 Accelerate has a gradient accumulation wrapper, and the `no_trainer` scripts should be updated to include it!
An example can be seen [here](https://github.com/huggingface/…
-
Hi! While training on multi GPU and using gradient accumulation steps > 1 there's no substantial speedup with relation to a single GPU (there is a speedup if the value is equal to 1). I found followin…
dprze updated
3 months ago
-
I noticed in the supplementary material that the number of steps is 50,000, but in `main.py`, `steps_per_epoch=500`. I would like to ask if this is a mistake? Additionally, the `batch_size` and `gradi…
-
@awaelchli I found that in the `pretrain.py`, the accumulation steps are calculated based on global batch size, device number and micro batch size.
This works fine under single-node setting, e.g. glo…
-
I am trying to run single GPU to multinode distributed fine tuning for Llama3-70B and Llama3 8B Models.
Below is my training configuration:
SFT (Llama3 8B & 70B)
Epochs: 3
Gradient Accumulatio…
-
Hey guys!
I am about to pretrain a monolingual model using T5X (thank you for this!).
The routine I'll be following is based on ByT5 paper. However, I currently have access to a smaller TPU (v3-…