Closed azton closed 1 year ago
Short of copying in every HF model (no), we need to change the model initialization to wrap transformer blocks in activation checkpoints. This seems like it should work the same way as the examples for FSDP.
Short of copying in every HF model (no), we need to change the model initialization to wrap transformer blocks in activation checkpoints. This seems like it should work the same way as the examples for FSDP.