The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.
Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?
Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?
The README mentions:
Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?
Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?