This PR adds FSDP no_sync support, which doesn't synchronize gradients until the gradient accumulation step. It also fixes gradient accumulation by truncating the dataset length and correcting the modulus comparison. It adds improved logging compatibility with tqdm and updates the readme and arguments.
This PR adds FSDP no_sync support, which doesn't synchronize gradients until the gradient accumulation step. It also fixes gradient accumulation by truncating the dataset length and correcting the modulus comparison. It adds improved logging compatibility with tqdm and updates the readme and arguments.