Memory cost increase gradually until oom

I use the dmoe in deepspeed or fsdp. i find in the begining, the memory cost is about 33g. As the number of training increases, the occupied video memory increases a little bit and finally exceeds 80g of video memory, and OOM starts. Do you know what is the reason? my moe config is

kwargs = {
            "activation_fn": F.silu,
            "mlp_type":"mlp",
            "mlp_impl": "sparse",
            "hidden_size": 2048,
            "ffn_hidden_size": 2048,
            "moe_num_experts": self.moe_experts_num,
            "num_layers": 16,
            "moe_weight_parallelism": False,
            "moe_expert_model_parallelism": False,
            "moe_top_k": self.moe_topk,
            "moe_capacity_factor": 1.25,
            "moe_loss_weight": 0.01,
            "device": "cuda",
            # Handled by FSDP
            "bf16": False,
            "fp16": False,
            "bias": False,
            "return_bias": False,
            "shared_expert": False,
            "moe_lbl_in_fp32": False,
        }

i use fsdp to train phi model in multi-gpu multi node environment. pytorch 2.3.1 and python3.9. After 500 iteration, the memory cost will increase 1 GB

databricks / megablocks

Memory cost increase gradually until oom #154