[BUG/Help] Model inference memory keeps growing

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

I have added gradient-free calculation to the inference module. Why does the chatglm2 model continue to experience an increase in video memory for a long time? How to solve？

Expected Behavior

I want the model to infer for a long time without increasing the video memory

Steps To Reproduce

1.Run in the following environment 2.Use model default parameters 3.Long-running code model inference code 4.You can see the log WechatIMG116

Environment

- OS:Ubuntu 20.04
- Python: 3.8
- Transformers:4.26.1
- PyTorch:1.13 above version
- CUDA Support (`python -c "import torch; 117print(torch.cuda.is_available())"`) :True

Anything else?

Anything

THUDM / ChatGLM2-6B