I am currently trying to implement the training recipe from your Snowflake report. I have access to the same hardware (8xH100s), however, I am struggling to match the reported batch sizes.
In the report, gradient checkpointing is never mentioned, but it does feel like this would make the extremely large batch sizes possible. Could you confirm whether gradient checkpointing was used?
Hi there!
I am currently trying to implement the training recipe from your Snowflake report. I have access to the same hardware (8xH100s), however, I am struggling to match the reported batch sizes.
In the report, gradient checkpointing is never mentioned, but it does feel like this would make the extremely large batch sizes possible. Could you confirm whether gradient checkpointing was used?
Thank you.