Closed TheRootOf3 closed 3 days ago
deepspeed + transformers step by step guide: https://huggingface.co/docs/transformers/main/deepspeed?install=Transformers
Update as of 2024-09-12: 8B model does not fit using the current method with full sharding without offloading. We identified that the reason for this lies in us using 2 separate models and accelerate not supporting sharding both of them. Hence, we will progress by pre-computing normal responses and not using the baseline models at all while unlearning. Tracked in #137 and #138.
The issue with offloading is the very inefficient use of GPU (more than 90% of the unlearning time is spent on offloading and loading memory). Instead, we could try to use parallelize the unlearning on 2 or 4 gpus by sharding model parameters using approaches such as FSDP? Or other method (deepspeed with sharding).