Open lolikonloli opened 10 months ago
Hello, it's a normal, because of the multi-scale training and denoising query, the model's memory usage is not that stable, it may takes about more than 12GB of 2080Ti, you can try to use fp16
training or lower the total_batch_size
to skip this issue, or you can try to add activation checkpoint
to reduce the memory usage of the total model
Divice info
sys.platform linux Python 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0] numpy 1.22.4 detectron2 0.6 @/home/lolikonloli/code/detection/package/detrex/detectron2/detectron2 Compiler GCC 11.4 CUDA compiler CUDA 11.8 detectron2 arch flags 7.5 DETECTRON2_ENV_MODULE
PyTorch 2.0.1+cu118 @/home/lolikonloli/anaconda3/envs/pl_det/lib/python3.10/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1 NVIDIA GeForce RTX 2080 Ti (arch=7.5)
Driver version 535.104.05
CUDA_HOME /usr/local/cuda-11.8
Pillow 9.3.0
torchvision 0.15.2+cu118 @/home/lolikonloli/anaconda3/envs/pl_det/lib/python3.10/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.8.0
PyTorch built with:
describe
Memory continuously increases during DINO training with two 2080ti GPUs until it gets killed by the system.