I use the dmoe in deepspeed or fsdp. i find in the begining, the memory cost is about 33g. As the number of training increases, the occupied video memory increases a little bit and finally exceeds 80g of video memory, and OOM starts.
Do you know what is the reason?
my moe config is
i use fsdp to train phi model in multi-gpu multi node environment. pytorch 2.3.1 and python3.9.
After 500 iteration, the memory cost will increase 1 GB
I use the dmoe in deepspeed or fsdp. i find in the begining, the memory cost is about 33g. As the number of training increases, the occupied video memory increases a little bit and finally exceeds 80g of video memory, and OOM starts. Do you know what is the reason? my moe config is
i use fsdp to train phi model in multi-gpu multi node environment. pytorch 2.3.1 and python3.9. After 500 iteration, the memory cost will increase 1 GB