I have added gradient-free calculation to the inference module. Why does the chatglm2 model continue to experience an increase in video memory for a long time? How to solve?
Expected Behavior
I want the model to infer for a long time without increasing the video memory
Steps To Reproduce
1.Run in the following environment
2.Use model default parameters
3.Long-running code model inference code
4.You can see the log
Environment
- OS:Ubuntu 20.04
- Python: 3.8
- Transformers:4.26.1
- PyTorch:1.13 above version
- CUDA Support (`python -c "import torch; 117print(torch.cuda.is_available())"`) :True
Is there an existing issue for this?
Current Behavior
I have added gradient-free calculation to the inference module. Why does the chatglm2 model continue to experience an increase in video memory for a long time? How to solve?
Expected Behavior
I want the model to infer for a long time without increasing the video memory
Steps To Reproduce
1.Run in the following environment 2.Use model default parameters 3.Long-running code model inference code 4.You can see the log
Environment
Anything else?
Anything