Memory allocation problem #94

Open YeSho-cpp opened 10 months ago

YeSho-cpp commented 10 months ago

Hello, sorry to bother you, I am running a nuclear data set with maskdino, but my problem now is insufficient memory, my bathsize is changed to 2, numworkers is changed to 0, and I started running, but the efficiency is too slow, numworkers will report memory allocation failure even if it is changed to 1. I have two a6000 graphics cards, but they cannot be distributed and used at the same time, otherwise the memory can not be allocated. I would like to ask you which parameters should be modified to reduce the use of memory.

YeSho-cpp commented 10 months ago

This is my data set information

[10/18 11:09:21] INFO: Loading /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json takes 2.70 seconds. [10/18 11:09:21] INFO: Loaded 432 images in COCO format from /share/home/ncu10/Code/AI/Point_label/PointWSSIS/cell_data_root/coco/annotations/instances_train2017.json [10/18 11:09:21] INFO: Removed 0 images with no usable annotations. 432 images left. [10/18 11:09:21] INFO: Distribution of instances among all 80 categories:  category #instances category #instances category #instances
total 17073 

[10/18 11:09:21] INFO: Using training sampler TrainingSampler [10/18 11:09:21] INFO: Serializing the dataset using: <class ''> [10/18 11:09:21] INFO: Serializing 432 elements to byte tensors and concatenating them all ... [10/18 11:09:22] INFO: Serialized dataset takes 28.01 MiB

YeSho-cpp commented 10 months ago

Using the resnet50 model [10/18 11:09:13] detectron2 INFO: Rank of current process: 0. World size: 1 [10/18 11:09:14] detectron2 INFO: Environment info:

sys.platform linux Python 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] numpy 1.24.4 detectron2 0.6 @/share/home/ncu10/Code/AI/Point_label/MaskDINO/detectron2/detectron2 Compiler GCC 9.4 CUDA compiler CUDA 11.4 detectron2 arch flags 8.6 DETECTRON2_ENV_MODULE PyTorch 1.13.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torch PyTorch debug build False torch._C._GLIBCXX_USE_CXX11_ABI False GPU available Yes GPU 0 NVIDIA RTX A6000 (arch=8.6) Driver version 470.86 CUDA_HOME /share/home/ncu10/CUDA/CUDA11.4 Pillow 9.5.0 torchvision 0.14.1 @/share/home/ncu10/miniconda3/envs/py38/lib/python3.8/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20221221 iopath 0.1.9 cv2 4.8.0

CUDA_VISIBLE_DEVICES=1 python --num-gpus 1 --config-file /share/home/ncu10/Code/AI/Point_label/MaskDINO/configs/coco/instance-segmentation/maskdino_R50_bs16_50ep_3s.yaml MODEL.WEIGHTS /share/home/ncu10/Code/AI/Point_label/MaskDINO/model_file/maskdino_r50_50ep_300q_hid1024_3sd1_instance_maskenhanced_mask46.1ap_box51.5ap.pth

sym330 commented 9 months ago

same error

FengLi-ust commented 2 months ago

Sorry for the late reply. How much memory do you need in our case? We use about 30G for Resnet50 batch size 4.