CUDA out of memory - Githubissues

saswat0 commented 1 year ago

I have a dataset with 228454 images. Everytime I try to train OS2D on this dataset, CUDA runs out of memory

python main.py --config-file experiments/config_training.yml model.use_inverse_geom_model True model.use_simplified_affine_model False train.objective.loc_weight 0.0 train.model.freeze_bn_transform True model.backbone_arch ResNet50 init.model models/imagenet-caffe-resnet50-features-ac468af-converted.pth init.transform models/weakalign_resnet101_affine_tps.pth.tar train.mining.do_mining False output.path output/os2d_v2-train

I'm already using a small batch size (4) and have set cfg.eval.scales_of_image_pyramid to [0.5, 0.625, 1, 1.6] but still this error persists

Full trace of the log:

2022-12-17 11:52:14,366 OS2D INFO: Loaded configuration file experiments/config_training.yml
2022-12-17 11:52:14,366 OS2D INFO:
output:
  path: "" # Substitute ""
  save_iter: 0
  best_model:
    do_get_best_model: True
    dataset: "" # use the first validation dataset
    metric: "mAP@0.50"
    mode: "max"
is_cuda: True
random_seed: 0
init:
  model: "" # Substitute "models/resnet50-19c8e357.pth"
model:
  backbone_arch: "" # Substitute "ResNet50" or "ResNet101"
  use_inverse_geom_model: False # Substitute v1: False v2 : True
  use_simplified_affine_model: True # Substitute v1: True v2 : False
train:
  dataset_name: "grozi-train"
  dataset_scale: 789.0
  objective:
    class_objective: "RLL"
    loc_weight: 0.0 # Substitute v1: 0.2, v2: 0.0
    positive_iou_threshold: 0.5
    negative_iou_threshold: 0.1
    remap_classification_targets: True
    remap_classification_targets_iou_pos: 0.8
    remap_classification_targets_iou_neg: 0.4
  optim:
    anneal_lr:
      type: "MultiStepLR"
      milestones: [100000, 150000]
      gamma: 0.1

  model:
    freeze_bn: True
    freeze_bn_transform: True # Substitute v1: False, v2: True
    train_transform_on_negs: False
eval:
  iter: 1000
  dataset_names: ("grozi-val-new-cl",)
  dataset_scales: (789.0,)
  mAP_iou_thresholds: (0.5,)

2022-12-17 11:52:14,366 OS2D INFO: Running with config:
eval:
  batch_size: 1
  cache_images: False
  class_image_augmentation:
  dataset_names: ['grozi-val-new-cl']
  dataset_scales: [789.0]
  iter: 1000
  mAP_iou_thresholds: [0.5]
  nms_across_classes: False
  nms_iou_threshold: 0.3
  nms_score_threshold: -inf
  scales_of_image_pyramid: [0.5, 0.625, 1, 1.6]
  train_subset_for_eval_size: 0
init:
  model: models/imagenet-caffe-resnet50-features-ac468af-converted.pth
  transform: models/weakalign_resnet101_affine_tps.pth.tar
is_cuda: True
model:
  backbone_arch: ResNet50
  class_image_size: 240
  merge_branch_parameters: True
  normalization_mean: [0.485, 0.456, 0.406]
  normalization_std: [0.229, 0.224, 0.225]
  use_group_norm: False
  use_inverse_geom_model: True
  use_simplified_affine_model: False
output:
  best_model:
    dataset:
    do_get_best_model: True
    metric: mAP@0.50
    mode: max
  path: output/os2d_v2-train
  print_iter: 1
  save_iter: 0
  save_log_to_file: False
random_seed: 0
train:
  augment:
    jitter_aspect_ratio: 0.9
    min_box_coverage: 0.7
    mine_extra_class_images: False
    random_color_distortion: False
    random_crop_class_images: False
    random_flip_batches: False
    scale_jitter: 0.7
    train_patch_height: 600
    train_patch_width: 600
  batch_size: 4
  cache_images: False
  class_batch_size: 15
  dataset_name: grozi-train
  dataset_scale: 789.0
  do_training: True
  mining:
    do_mining: False
    mine_hard_patches_iter: 5000
    nms_iou_threshold_in_mining: 0.5
    num_hard_patches_per_image: 10
    num_random_negative_classes: 200
    num_random_pyramid_scales: 2
  model:
    freeze_bn: True
    freeze_bn_transform: True
    freeze_transform: False
    num_frozen_extractor_blocks: 0
    train_features: True
    train_transform_on_negs: False
  objective:
    class_neg_weight: 1.0
    class_objective: RLL
    loc_weight: 0.0
    neg_margin: 0.5
    neg_to_pos_ratio: 3
    negative_iou_threshold: 0.1
    pos_margin: 0.6
    positive_iou_threshold: 0.5
    remap_classification_targets: True
    remap_classification_targets_iou_neg: 0.4
    remap_classification_targets_iou_pos: 0.8
    rll_neg_weight_ratio: 0.001
  optim:
    anneal_lr:
      cooldown: 10000
      gamma: 0.1
      initial_patience: 0
      milestones: [100000, 150000]
      min_value: 1e-05
      patience: 1000
      quantity_epsilon: 0.01
      quantity_mode: max
      quantity_smoothness: 2000
      quantity_to_monitor: mAP@0.50_grozi-val-new-cl
      reduce_factor: 0.5
      reload_best_model_after_anneal_lr: True
      type: MultiStepLR
    lr: 0.0001
    max_grad_norm: 100.0
    max_iter: 200000
    optim_method: sgd
    sgd_momentum: 0.9
    weight_decay: 0.0001
visualization:
  eval:
    images_for_heatmaps: []
    labels_for_heatmaps: []
    max_detections: 10
    path_to_save_detections:
    score_threshold: -inf
    show_class_heatmaps: False
    show_detections: False
    show_gt_boxes: False
  mining:
    images_for_heatmaps: []
    labels_for_heatmaps: []
    max_detections: 10
    score_threshold: -inf
    show_class_heatmaps: False
    show_gt_boxes: False
    show_mined_patches: False
  train:
    max_detections: 5
    score_threshold: -inf
    show_detections: False
    show_gt_boxes_dataloader: False
    show_target_remapping: False
2022-12-17 11:52:14,366 OS2D INFO: Saving config into: output/os2d_v2-train/config.yml
2022-12-17 11:52:14,435 OS2D INFO: Building the OS2D model
2022-12-17 11:52:17,056 OS2D INFO: Creating model on one GPU
2022-12-17 11:52:17,067 OS2D INFO: Reading model file models/imagenet-caffe-resnet50-features-ac468af-converted.pth
2022-12-17 11:52:17,101 OS2D INFO: Cannot find 'net' in the checkpoint file
2022-12-17 11:52:17,101 OS2D INFO: Failed to load the full model, trying to init feature extractors
2022-12-17 11:52:17,101 OS2D INFO: Trying to init from models/imagenet-caffe-resnet50-features-ac468af-converted.pth
2022-12-17 11:52:17,145 OS2D INFO: FAILED to load as network
2022-12-17 11:52:17,145 OS2D INFO: Trying to init from models/imagenet-caffe-resnet50-features-ac468af-converted.pth as checkpoint
2022-12-17 11:52:17,145 OS2D INFO: FAILED to load as checkpoint
2022-12-17 11:52:17,145 OS2D INFO: Could not init the full feature extractor. Trying to init form a weakalign model
2022-12-17 11:52:17,145 OS2D INFO: Could not init from the weakalign network. Trying to init backbone from models/imagenet-caffe-resnet50-features-ac468af-converted.pth.
2022-12-17 11:52:17,155 OS2D INFO: Successfully initialized backbone.
2022-12-17 11:52:17,160 OS2D INFO: Trying to init affine transform from models/weakalign_resnet101_affine_tps.pth.tar
2022-12-17 11:52:17,351 OS2D INFO: Successfully initialized the affine transform from the provided weakalign model.
2022-12-17 11:52:17,353 OS2D INFO: OS2D has 139 blocks of 10169478 parameters (before freezing)
2022-12-17 11:52:17,353 OS2D INFO: OS2D has 139 blocks of 10169478 trainable parameters
2022-12-17 11:52:17,354 OS2D.dataset INFO: Preparing the GroZi-3.2k dataset: version grozi-train, eval scale 789.0, image caching False
2022-12-17 11:52:17,982 OS2D.dataset INFO: Reading query images
2022-12-17 11:53:01,132 OS2D.dataset INFO: Read 14136 GT images
2022-12-17 11:53:01,149 OS2D.dataset INFO: Reading target images
100%|██████████| 194526/194526 [6:02:58<00:00,  8.93it/s]
2022-12-17 17:56:00,012 OS2D.dataset INFO: Found 194526 data images
2022-12-17 17:56:00,030 OS2D.dataset INFO: Loaded dataset grozi-train with 194526 images, 291076 boxes, 14136 classes
2022-12-17 17:56:00,044 OS2D.eval.dataset INFO: Preparing the GroZi-3.2k dataset: version grozi-val-new-cl, eval scale 789.0, image caching False
2022-12-17 17:56:00,561 OS2D.eval.dataset INFO: Reading query images
2022-12-17 17:56:14,931 OS2D.eval.dataset INFO: Read 12213 GT images
2022-12-17 17:56:14,936 OS2D.eval.dataset INFO: Reading target images
100%|██████████| 63354/63354 [2:05:02<00:00,  8.44it/s]
2022-12-17 20:01:17,803 OS2D.eval.dataset INFO: Found 63354 data images
2022-12-17 20:01:17,812 OS2D.eval.dataset INFO: Loaded dataset grozi-val-new-cl with 63354 images, 72769 boxes, 12213 classes
2022-12-17 20:01:17,955 OS2D.train INFO: Start training
2022-12-17 20:01:17,955 OS2D.evaluate INFO: Starting to eval on grozi-val-new-cl, scale 789.0
2022-12-17 20:01:17,956 OS2D.evaluate INFO: Extracting scores from all images
2022-12-17 20:01:32,900 OS2D.evaluate INFO: Extracting weights from 12213 classes
2022-12-17 20:05:07,670 OS2D.eval.dataloader INFO: Image batch 0 out of 63354
/data1/saswats/baseline/os2d/os2d/engine/evaluate.py:292: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  batch_class_ids = [class_ids[l // num_class_views] for l in batch_labels_local]
/data1/saswats/baseline/os2d/os2d/engine/evaluate.py:293: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  batch_query_img_sizes = [query_img_sizes[l // num_class_views] for l in batch_labels_local]
/data1/saswats/miniconda3/envs/os2d/lib/python3.6/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  /opt/conda/conda-bld/pytorch_1639180593867/work/aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
2022-12-17 20:09:05,610 OS2D.evaluate INFO: Feature time: 4.81s, Label time: 232.43s, Net time: 0h 3m 57s
2022-12-17 20:09:10,840 OS2D.evaluate INFO: loss 0.1083, class_loss_per_element_detached_cpu 0.0000, loc_smoothL1 4.5910, cls_RLL 0.1083, cls_RLL_pos 0.0930, cls_RLL_neg 0.0152,
2022-12-17 20:10:26,108 OS2D.eval.dataloader INFO: Image batch 1 out of 63354
2022-12-17 20:14:18,661 OS2D.evaluate INFO: Feature time: 0.07s, Label time: 231.11s, Net time: 0h 3m 52s
2022-12-17 20:14:23,169 OS2D.evaluate INFO: loss 0.1185, class_loss_per_element_detached_cpu 0.0000, loc_smoothL1 4.6364, cls_RLL 0.1185, cls_RLL_pos 0.1032, cls_RLL_neg 0.0154,
2022-12-17 20:15:39,063 OS2D.eval.dataloader INFO: Image batch 2 out of 63354
2022-12-17 20:19:31,042 OS2D.evaluate INFO: Feature time: 0.06s, Label time: 230.85s, Net time: 0h 3m 51s
Traceback (most recent call last):
  File "main.py", line 98, in <module>
    main()
  File "main.py", line 94, in main
    trainval_loop(dataloader_train, net, cfg, criterion, optimizer, dataloaders_eval=dataloaders_eval)
  File "/data1/saswats/baseline/os2d/os2d/engine/train.py", line 426, in trainval_loop
    meters_eval = evaluate_model(dataloaders_eval, net, cfg, criterion)
  File "/data1/saswats/baseline/os2d/os2d/engine/train.py", line 392, in evaluate_model
    meters_val = evaluate(dataloader, net, cfg, criterion=criterion, print_per_class_results=print_per_class_results)
  File "/data1/saswats/miniconda3/envs/os2d/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/data1/saswats/baseline/os2d/os2d/engine/evaluate.py", line 97, in evaluate
    add_batch_dim(class_targets_pyramid)
  File "/data1/saswats/miniconda3/envs/os2d/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/saswats/baseline/os2d/os2d/engine/objective.py", line 257, in forward
    neg_ranking = self._hard_negative_mining(cls_loss.unsqueeze(0), mask_all_negs.unsqueeze(0)).squeeze(0)  # [batch_size, num_labels, num_anchors]
  File "/data1/saswats/baseline/os2d/os2d/engine/objective.py", line 68, in _hard_negative_mining
    _, rank_mined = idx.sort(1)      # [batch_size, *]
RuntimeError: CUDA out of memory. Tried to allocate 1.93 GiB (GPU 0; 47.54 GiB total capacity; 41.29 GiB already allocated; 1.43 GiB free; 44.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

aosokin commented 1 year ago

We implemented evaluation by first extracting features from all class templates (class mages) and then using these features to detect everywhere. My hypothesis is that your dataset has too many classes to detect. I think that you can do one of the two things: 1) split classes in several "class batches" such that each batch would fit in GPU memory; 2) disable caching the class features and recompute everything on the fly - however this approach might slow down detection.

saswat0 commented 1 year ago

Yes, you're correct. I have a lot of classes (16k with 0.7M images). For the first approach that you suggested, is there any provision in the code to do so? For the second approach, is setting cache_images to False all that's needed?

aosokin commented 1 year ago

I'm afraid none of these are supported in the code. cache_images seems to do something different. Probably the easiest thing to do is to split data manually for the first approach. For the second approach, you'll need to changes the iterators over data.

saswat0 commented 1 year ago

Okay. I'll give it a shot. Thanks

aosokin / os2d

CUDA out of memory #41