Q on comparison with SFTTrainer

The README mentions:

The SFTTrainer version has to run with a lower batch size (4 vs 8) so we only do 2 gradient accumulation steps vs 4 in the QLoRA+FSDP version.

Is this reversed? If the batch size is smaller with SFTTrainer, wouldn't you use higher gradient accumulation?

Separately, I note that SFTTrainer and fsdp trainings take the same time on the graph shown. I assume SFTTrainer is using DDP, so it should be quite a bit slower, no? Perhaps even close to 2x slower because the batch size is smaller so there are more forward passes required?

AnswerDotAI / fsdp_qlora

Q on comparison with SFTTrainer #42