The problem of CUDA Out Of Memory occurred while running the demo

HYH-M87 commented 2 months ago

Problems: When I followed the instructions in the README document under "Inference Demo for UnSAM with Pre-trained Models (whole image segmentation)" and tried to run demo_whole_image.py, the following error occurred: [07/09 16:09:19 d2.utils.memory]: Attempting to copy inputs of <func to CPU due to CUDA OOM Killed Then, I keep checking the GPU and memory usage, and the results are as follows: 20240709-163727 It can be observed that the GPU memory is not fully utilized, but the CPU usage and memory usage are very high. After this, the program crashed.

I tried reducing the input image size by half, and the program ran normally, but the GPU memory was still fully utilized, and the CUDA OOM problem still occurred.

My questions: Is this normal? Does it need to use such a large amount of GPU memory?

The complete log output is as follows:

(UnSAM) ag01@iZ2zegvtmojinj2fz3u4biZ:~/UnSAM/whole_image_segmentation$ python demo_whole_image.py --input ../docs/demos/sa_234337.jpg --output ../docs/demos/pred_sa_234337.jpg  --opts MODEL.WEIGHTS ../unsam_plus_sa1b_1perc_ckpt_50k.pth 
[07/09 16:09:14 detectron2]: Rank of current process: 0. World size: 1
[07/09 16:09:15 detectron2]: Environment info:
----------------------  -----------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.9.19 (main, May  6 2024, 19:43:03) [GCC 11.2.0]
numpy                   1.23.1
detectron2              0.6 @/mnt/local_disk/ag01/anaconda3/envs/UnSAM/lib/python3.9/site-packages/detectron2
Compiler                GCC 9.4
CUDA compiler           CUDA 12.1
detectron2 arch flags   6.0
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 2.0.1 @/mnt/local_disk/ag01/anaconda3/envs/UnSAM/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           Yes
GPU 0                   Tesla P100-PCIE-16GB (arch=6.0)
Driver version          530.30.02
CUDA_HOME               /usr/local/cuda-12.1
Pillow                  9.4.0
torchvision             0.15.2 @/mnt/local_disk/ag01/anaconda3/envs/UnSAM/lib/python3.9/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                  0.1.5.post20221221
iopath                  0.1.9
cv2                     4.8.1
----------------------  -----------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2023.1-Product Build 20230303 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90;-gencode;arch=compute_37,code=compute_37
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

[07/09 16:09:15 detectron2]: Command line arguments: Namespace(config_file='configs/maskformer2_R50_bs16_50ep.yaml', input='../docs/demos/sa_234337.jpg', output='../docs/demos/pred_sa_234337.jpg', confidence_thresh=0.5, opts=['MODEL.WEIGHTS', '../unsam_plus_sa1b_1perc_ckpt_50k.pth'])
[07/09 16:09:15 detectron2]: Contents of args.config_file=configs/maskformer2_R50_bs16_50ep.yaml:
_BASE_: Base-COCO-InstanceSegmentation.yaml
MODEL:
  META_ARCHITECTURE: "MaskFormer"
  SEM_SEG_HEAD:
    NAME: "MaskFormerHead"
    IGNORE_VALUE: 255
    NUM_CLASSES: 1 #80
    LOSS_WEIGHT: 1.0
    CONVS_DIM: 256
    MASK_DIM: 256
    NORM: "GN"
    # pixel decoder
    PIXEL_DECODER_NAME: "MSDeformAttnPixelDecoder"
    IN_FEATURES: ["res2", "res3", "res4", "res5"]
    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: ["res3", "res4", "res5"]
    COMMON_STRIDE: 4
    TRANSFORMER_ENC_LAYERS: 6
  MASK_FORMER:
    TRANSFORMER_DECODER_NAME: "MultiScaleMaskedTransformerDecoder"
    TRANSFORMER_IN_FEATURE: "multi_scale_pixel_decoder"
    DEEP_SUPERVISION: True
    NO_OBJECT_WEIGHT: 0.1
    CLASS_WEIGHT: 2.0
    MASK_WEIGHT: 5.0
    DICE_WEIGHT: 5.0
    HIDDEN_DIM: 256
    NUM_OBJECT_QUERIES: 2000 #100
    NHEADS: 8
    DROPOUT: 0.0
    DIM_FEEDFORWARD: 2048
    ENC_LAYERS: 0
    PRE_NORM: False
    ENFORCE_INPUT_PROJ: False
    SIZE_DIVISIBILITY: 32
    DEC_LAYERS: 10  # 9 decoder layers, add one for the loss on learnable query
    TRAIN_NUM_POINTS: 12544
    OVERSAMPLE_RATIO: 3.0
    IMPORTANCE_SAMPLE_RATIO: 0.75
    TEST:
      SEMANTIC_ON: False
      INSTANCE_ON: True
      PANOPTIC_ON: False
      OVERLAP_THRESHOLD: 0.8
      OBJECT_MASK_THRESHOLD: 0.8

[07/09 16:09:15 detectron2]: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
  ASPECT_RATIO_GROUPING: true
  FILTER_EMPTY_ANNOTATIONS: true
  NUM_WORKERS: 1
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: []
  PROPOSAL_FILES_TRAIN: []
  SELF_TRAIN: false
  SELF_TRAIN_NUMBER: 1
  SELF_TRAIN_THRESH_HIGH: 1.0
  SELF_TRAIN_THRESH_LOW: 0.0
  TEST:
  - unsam_sa1b_val
  TRAIN:
  - unsam_sa1b_train
GLOBAL:
  HACK: 1.0
INPUT:
  COLOR_AUG_SSD: false
  CROP:
    ENABLED: false
    SINGLE_CATEGORY_MAX_AREA: 1.0
    SIZE:
    - 0.9
    - 0.9
    TYPE: relative_range
  DATASET_MAPPER_NAME: sam_instance_tsv
  FORMAT: RGB
  IMAGE_SIZE: 512
  MASK_FORMAT: polygon
  MAX_SCALE: 2.0
  MAX_SIZE_TEST: 2048
  MAX_SIZE_TRAIN: 1333
  MIN_SCALE: 0.1
  MIN_SIZE_TEST: 512
  MIN_SIZE_TRAIN:
  - 800
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
  SIZE_DIVISIBILITY: -1
MODEL:
  ANCHOR_GENERATOR:
    ANGLES:
    - - -90
      - 0
      - 90
    ASPECT_RATIOS:
    - - 0.5
      - 1.0
      - 2.0
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES:
    - - 32
      - 64
      - 128
      - 256
      - 512
  BACKBONE:
    FREEZE_AT: 0
    NAME: build_resnet_backbone
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: ''
    OUT_CHANNELS: 256
  KEYPOINT_ON: false
  LOAD_PROPOSALS: false
  MASK_FORMER:
    CLASS_WEIGHT: 2.0
    DEC_LAYERS: 10
    DEEP_SUPERVISION: true
    DICE_WEIGHT: 5.0
    DIM_FEEDFORWARD: 2048
    DROPOUT: 0.0
    ENC_LAYERS: 0
    ENFORCE_INPUT_PROJ: false
    HIDDEN_DIM: 256
    IMPORTANCE_SAMPLE_RATIO: 0.75
    MASK_WEIGHT: 5.0
    NHEADS: 8
    NO_OBJECT_WEIGHT: 0.1
    NUM_OBJECT_QUERIES: 2000
    OVERSAMPLE_RATIO: 3.0
    PRE_NORM: false
    SIZE_DIVISIBILITY: 32
    TEST:
      INSTANCE_ON: true
      OBJECT_MASK_THRESHOLD: 0.8
      OVERLAP_THRESHOLD: 0.8
      PANOPTIC_ON: false
      SEMANTIC_ON: false
      SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE: false
    TRAIN_NUM_POINTS: 12544
    TRANSFORMER_DECODER_NAME: MultiScaleMaskedTransformerDecoder
    TRANSFORMER_IN_FEATURE: multi_scale_pixel_decoder
  MASK_ON: false
  META_ARCHITECTURE: MaskFormer
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: true
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN:
  - 123.675
  - 116.28
  - 103.53
  PIXEL_STD:
  - 58.395
  - 57.12
  - 57.375
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    - false
    - false
    - false
    - false
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES:
    - res2
    - res3
    - res4
    - res5
    RES2_OUT_CHANNELS: 256
    RES4_DILATION: 1
    RES5_DILATION: 1
    RES5_MULTI_GRID:
    - 1
    - 1
    - 1
    STEM_OUT_CHANNELS: 64
    STEM_TYPE: basic
    STRIDE_IN_1X1: false
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: &id002
    - 1.0
    - 1.0
    - 1.0
    - 1.0
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES:
    - p3
    - p4
    - p5
    - p6
    - p7
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.4
    - 0.5
    NMS_THRESH_TEST: 0.5
    NORM: ''
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS:
    - &id001
      - 10.0
      - 10.0
      - 5.0
      - 5.0
    - - 20.0
      - 20.0
      - 10.0
      - 10.0
    - - 30.0
      - 30.0
      - 15.0
      - 15.0
    IOUS:
    - 0.5
    - 0.6
    - 0.7
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id001
    CLS_AGNOSTIC_BBOX_REG: false
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: ''
    NORM: ''
    NUM_CONV: 0
    NUM_FC: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: false
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - 1
    IOU_THRESHOLDS:
    - 0.5
    NAME: Res5ROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 80
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: true
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS:
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    - 512
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: false
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: ''
    NUM_CONV: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: *id002
    BOUNDARY_THRESH: -1
    CONV_DIMS:
    - -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES:
    - res4
    IOU_LABELS:
    - 0
    - -1
    - 1
    IOU_THRESHOLDS:
    - 0.3
    - 0.7
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 2000
    PRE_NMS_TOPK_TEST: 6000
    PRE_NMS_TOPK_TRAIN: 12000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    ASPP_CHANNELS: 256
    ASPP_DILATIONS:
    - 6
    - 12
    - 18
    ASPP_DROPOUT: 0.1
    COMMON_STRIDE: 4
    CONVS_DIM: 256
    DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES:
    - res3
    - res4
    - res5
    DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS: 8
    DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS: 4
    IGNORE_VALUE: 255
    IN_FEATURES:
    - res2
    - res3
    - res4
    - res5
    LOSS_TYPE: hard_pixel_mining
    LOSS_WEIGHT: 1.0
    MASK_DIM: 256
    NAME: MaskFormerHead
    NORM: GN
    NUM_CLASSES: 1
    PIXEL_DECODER_NAME: MSDeformAttnPixelDecoder
    PROJECT_CHANNELS:
    - 48
    PROJECT_FEATURES:
    - res2
    TRANSFORMER_ENC_LAYERS: 6
    USE_DEPTHWISE_SEPARABLE_CONV: false
  SWIN:
    APE: false
    ATTN_DROP_RATE: 0.0
    DEPTHS:
    - 2
    - 2
    - 6
    - 2
    DROP_PATH_RATE: 0.3
    DROP_RATE: 0.0
    EMBED_DIM: 96
    MLP_RATIO: 4.0
    NUM_HEADS:
    - 3
    - 6
    - 12
    - 24
    OUT_FEATURES:
    - res2
    - res3
    - res4
    - res5
    PATCH_NORM: true
    PATCH_SIZE: 4
    PRETRAIN_IMG_SIZE: 224
    QKV_BIAS: true
    QK_SCALE: null
    USE_CHECKPOINT: false
    WINDOW_SIZE: 7
  WEIGHTS: ../unsam_plus_sa1b_1perc_ckpt_50k.pth
OUTPUT_DIR: ./output
SEED: -1
SOLVER:
  AMP:
    ENABLED: true
  BACKBONE_MULTIPLIER: 0.1
  BASE_LR: 1.0e-06
  BASE_LR_END: 0.0
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: full_model
    CLIP_VALUE: 0.01
    ENABLED: true
    NORM_TYPE: 2.0
  DROPLOSS_IOU_THRESH: 0.01
  GAMMA: 0.1
  IMS_PER_BATCH: 4
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 200000
  MOMENTUM: 0.9
  NESTEROV: false
  OPTIMIZER: ADAMW
  POLY_LR_CONSTANT_ENDING: 0.0
  POLY_LR_POWER: 0.9
  REFERENCE_WORLD_SIZE: 0
  STEPS:
  - 177776
  - 192592
  USE_DROP_LOSSES: false
  WARMUP_FACTOR: 1.0
  WARMUP_ITERS: 10
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.05
  WEIGHT_DECAY_BIAS: null
  WEIGHT_DECAY_EMBED: 0.0
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    - 400
    - 500
    - 600
    - 700
    - 800
    - 900
    - 1000
    - 1100
    - 1200
  DETECTIONS_PER_IMAGE: 1000
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: false
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0

[07/09 16:09:15 detectron2]: Full config saved to ./output/config.yaml
[07/09 16:09:15 d2.utils.env]: Using a generated random seed 15351907
[07/09 16:09:17 fvcore.common.checkpoint]: [Checkpointer] Loading from ../unsam_plus_sa1b_1perc_ckpt_50k.pth ...
/mnt/local_disk/ag01/anaconda3/envs/UnSAM/lib/python3.9/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/conda/conda-bld/pytorch_1682343997789/work/aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
[07/09 16:09:19 d2.utils.memory]: Attempting to copy inputs of <func to CPU due to CUDA OOM
Killed

systeminfo

(UnSAM) ag01@iZ2zegvtmojinj2fz3u4biZ:~/UnSAM/whole_image_segmentation$ nvidia-smi
Tue Jul  9 16:30:47 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB            On | 00000000:00:08.0 Off |                    0 |
| N/A   29C    P0               25W / 250W|      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

(UnSAM) ag01@iZ2zegvtmojinj2fz3u4biZ:~/UnSAM/whole_image_segmentation$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  2
Core(s) per socket:  2
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
Stepping:            1
CPU MHz:             2499.996
BogoMIPS:            4999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            40960K
NUMA node0 CPU(s):   0-3
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

frank-xwang commented 2 months ago

Hello, the "out of memory" issue you encountered is primarily due to the upsampling function in Mask2Former. Initially, the function attempts to run the upsampling on the GPU, but if it encounters an out-of-memory (OOM) error, it automatically switches to CPU processing by calling retry_if_cuda_oom. For more specific details, please refer to the forward function in whole_image_segmentation/mask2former/maskformer_model.py.

To fully address your question and clarify the source of this error, let's discuss how Mask2Former processes mask predictions. The model first generates low-resolution masks and then uses the F.interpolate() function to upsample these masks to match the size of the original image. Given that our model outputs several hundred masks per image and the SA1B images are of very high resolution, the F.interpolate() function requires significant memory capacity on either the GPU or CPU.

Here are a few strategies to fix this memory issue:

Utilize a machine with more VRAM ( :-( sorrryyy ). The exact requirement will vary depending on the image size and the number of predicted masks. For context, the demo was successfully run on a machine with 504G of CPU memory.
Reduce the number of predicted masks by adjusting the confidence score threshold (TEST.OBJECT_MASK_THRESHOLD) or limiting the maximum number of predictions per image using parameters like TEST.DETECTIONS_PER_IMAGE.
Modify the forward function to upsample masks in batches rather than all at once. This can be done by upsampling a portion of the masks separately and then concatenating the results.

I'm here to help if you have more questions or need further clarification!

summelon commented 2 months ago

The prompt demo has some bug relative to device which I haven't checked. But I could successfully run the whole image demo by change the batch size to 1 and use a 3090 (24G) + ~70G or more CPU RAM

frank-xwang / UnSAM

The problem of CUDA Out Of Memory occurred while running the demo #7