[CODE OPTIMISATION] Unlearning a 8B model without offloading parameters (2 GPUs?)

Adamliu1 / SNLP_GCW

3 stars 0 forks source link

[CODE OPTIMISATION] Unlearning a 8B model without offloading parameters (2 GPUs?) #124

Closed TheRootOf3 closed 3 days ago

TheRootOf3 commented 3 weeks ago

The issue with offloading is the very inefficient use of GPU (more than 90% of the unlearning time is spent on offloading and loading memory). Instead, we could try to use parallelize the unlearning on 2 or 4 gpus by sharding model parameters using approaches such as FSDP? Or other method (deepspeed with sharding).

Willmish commented 2 weeks ago

deepspeed + transformers step by step guide: https://huggingface.co/docs/transformers/main/deepspeed?install=Transformers

TheRootOf3 commented 3 days ago

Update as of 2024-09-12: 8B model does not fit using the current method with full sharding without offloading. We identified that the reason for this lies in us using 2 separate models and accelerate not supporting sharding both of them. Hence, we will progress by pre-computing normal responses and not using the baseline models at all while unlearning. Tracked in #137 and #138.