We conducted all experiments on 8-A100 GPU servers. The memory consumption of different settings ranges from 26Gb to 36GB per GPU. Less GPU memory is needed if we use more torch.utils.checkpoint . The training time ranges from 3 days to 10 days, depending on the usage of torch.utils.checkpoint (more checkpoints, less memory, more training time).
We conducted all experiments on 8-A100 GPU servers. The memory consumption of different settings ranges from 26Gb to 36GB per GPU. Less GPU memory is needed if we use more
torch.utils.checkpoint
. The training time ranges from 3 days to 10 days, depending on the usage oftorch.utils.checkpoint
(more checkpoints, less memory, more training time).