Snowflake-Labs / snowflake-arctic

Apache License 2.0
511 stars 41 forks source link

Trouble replicating the training procedure -- Batch size #26

Closed olivierr42 closed 2 months ago

olivierr42 commented 2 months ago

Hi there!

I am currently trying to implement the training recipe from your Snowflake report. I have access to the same hardware (8xH100s), however, I am struggling to match the reported batch sizes.

In the report, gradient checkpointing is never mentioned, but it does feel like this would make the extremely large batch sizes possible. Could you confirm whether gradient checkpointing was used?

Thank you.