-
Thanks for the great work!
I have some questions about the training configuration.
For the training batch size, I assume that we will collect rollout_batch_size = 1024 trajectories into the repl…
-
From discourse https://discourse.julialang.org/t/zygote-gradient-accumulation/55654
I have a densenet inspired architecture implemented in pytorch and ported it to julia. Sadly I get out of memory …
-
Hello GFSLT-VLP,
Thank you for sharing your work. I tried reproducing the results as reported in your paper, specifically by using the VLP Pretrain V2 command and the GFSLT-VLP command on a single …
-
### System Info
A100 Nvidia 80G GPU
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task…
-
### Bug description
At the end of an epoch with accumulate_grad_batches>1 the dataloader may run out of data before the required number of accumulations. The lightning docs do not say what happens. I…
-
In my training script, I set the **per_device_train_batch_size = 4** in the TrainingArguments.
But the **train_batch_size** in the **trainer_state.json** of each checkpoint is **2**.
When I tried …
-
### Describe the bug
Hi,
I have been working on training scripts for multiple models (T2I, IP2P) and found the different logic to calculate `step` and `epoch` while resuming training different acr…
-
torchrun --nnodes=1 --nproc_per_node=8 --master_port=25001 \
llava/train/train_mem.py \
--model_name_or_path /path/to/checkpoint_llava_med \
--data_path /path/to/your_dental_dataset.jso…
-
Hello,
I see your code in the internet, and this is so interesting. I cloned it and used for my project. I try changing batch_size from 1 to 4 and backbone from vgg16 to resnet101. But I have a pr…
-
### 🐛 Describe the bug
I want to Continue pretrain llama-7b with only 8 A100-80G. And I want to set global batchsize to 1024. But I can't find gradient accumulation setting.
### Environment
_…