Closed muito93 closed 4 years ago
error is expected given the small total memory size.
This also happens in another machine with Nvidia GTX 1080 Ti 11Gb GPU memory. So how should we handle this typical case? We really need to use Custom Dataloader image transformation (withoud resizing the original image to 800x800 as in the Default Dataloader) due to information loss. @ppwwyyxx Any suggestion would be appriciated. Thanks.
Suggestions are to use smaller images, smaller models or larger GPUs.
Your random rotation augementation might also be a problem as it increases memory allocation
I met error: Exception has occurred: RuntimeError CUDA out of memory. Tried to allocate 190.00 MiB (GPU 0; 3.94 GiB total capacity; 2.12 GiB already allocated; 171.06 MiB free; 2.22 GiB reserved in total by PyTorch) File "/media/thanhvt/HDD_Data/UbuntuWork/LenseProject/Lenses_Detectron2/train.py", line 121, in
trainer.train()
I used custom Dataloaders. If I use T.RandomApply(transform=T.Resize(shape=(800, 800)), prob=1), it run well. But If I comment this line and use origin image size (1920*1080), CUDA out of memory appear. As I know, I use batch size is 1 (cfg.SOLVER.IMS_PER_BATCH = 1)
Instructions To Reproduce the Issue:
git diff
)from detectron2.structures import BoxMode from detectron2.engine import DefaultTrainer from detectron2.data import DatasetMapper
def get_scratch_lense_dicts(img_dir): json_file = os.path.join(img_dir, "via_project_9Aug2020_9h36m_json.json") with open(json_file) as f: imgs_anns = json.load(f)
for d in ["train", "val"]: DatasetCatalog.register("scratchlense" + d, lambda d=d: get_scratch_lense_dicts("scratch_lense/" + d)) MetadataCatalog.get("scratchlense" + d).set(thing_classes=["scratch_lense"]) scratch_lense_metadata = MetadataCatalog.get("scratch_lense_train")
cfg = get_cfg() cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml")) cfg.DATASETS.TRAIN = ("scratch_lense_train",) cfg.DATASETS.TEST = () cfg.DATALOADER.NUM_WORKERS = 8 cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml") # Let training initialize from model zoo cfg.SOLVER.IMS_PER_BATCH = 1 cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough for this toy dataset; you may need to train longer for a practical dataset cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 56 # faster, and good enough for this toy dataset (default: 512) cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (ballon) cfg.SOLVER.CHECKPOINT_PERIOD = 50 cfg.MAX_SIZE_TRAIN = 2000
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
import detectron2.data.transforms as T from detectron2.data import detection_utils as utils import copy def custom_mapper(dataset_dict):
Implement a mapper, similar to the default DatasetMapper, but with your own customizations
class CS_Trainer(DefaultTrainer): @classmethod def build_test_loader(cls, cfg, dataset_name): return build_detection_test_loader(cfg, dataset_name, mapper=DatasetMapper(cfg, False))
trainer = CS_Trainer(cfg) trainer.resume_or_load(resume=False) trainer.train()
from detectron2.evaluation import COCOEvaluator, inference_on_dataset from detectron2.data import build_detection_test_loader evaluator = COCOEvaluator("scratch_lense_val", cfg, False, output_dir="./output/") val_loader = build_detection_test_loader(cfg, "scratch_lense_val") print(inference_on_dataset(trainer.model, val_loader, evaluator))
[08/14 07:32:20 d2.data.common]: Serializing 17 elements to byte tensors and concatenating them all ... [08/14 07:32:20 d2.data.common]: Serialized dataset takes 0.00 MiB [08/14 07:32:20 d2.data.build]: Using training sampler TrainingSampler 2020-08-14 07:32:20.970958: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 Skip loading parameter 'roi_heads.box_predictor.cls_score.weight' to the model due to incompatible shapes: (81, 1024) in the checkpoint but (2, 1024) in the model! You might want to double check if this is expected. Skip loading parameter 'roi_heads.box_predictor.cls_score.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (2,) in the model! You might want to double check if this is expected. Skip loading parameter 'roi_heads.box_predictor.bbox_pred.weight' to the model due to incompatible shapes: (320, 1024) in the checkpoint but (4, 1024) in the model! You might want to double check if this is expected. Skip loading parameter 'roi_heads.box_predictor.bbox_pred.bias' to the model due to incompatible shapes: (320,) in the checkpoint but (4,) in the model! You might want to double check if this is expected. [08/14 07:32:23 d2.engine.train_loop]: Starting training from iteration 0 /home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/layers/wrappers.py:226: UserWarning: This overload of nonzero is deprecated: nonzero() Consider using one of the following signatures instead: nonzero(, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.) return x.nonzero().unbind(1) ERROR [08/14 07:32:25 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 226, in run_step loss_dict = self.model(data) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 157, in forward features = self.backbone(images.tensor) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, *kwargs) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/modeling/backbone/fpn.py", line 132, in forward lateral_features = lateral_conv(features) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2/layers/wrappers.py", line 94, in forward x = super().forward(x) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 419, in forward return self._conv_forward(input, self.weight) File "/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 415, in _conv_forward return F.conv2d(input, weight, self.bias, self.stride, RuntimeError: CUDA out of memory. Tried to allocate 190.00 MiB (GPU 0; 3.94 GiB total capacity; 2.12 GiB already allocated; 171.06 MiB free; 2.22 GiB reserved in total by PyTorch) [08/14 07:32:25 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks)
sys.platform linux Python 3.8.3 (default, Jul 2 2020, 16:21:59) [GCC 7.3.0] numpy 1.18.5 detectron2 0.2.1 @/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/detectron2 Compiler GCC 7.3 CUDA compiler CUDA 10.1 detectron2 arch flags sm_35, sm_37, sm_50, sm_52, sm_60, sm_61, sm_70, sm_75 DETECTRON2_ENV_MODULE
PyTorch 1.6.0+cu101 @/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0 GeForce GTX 1050 Ti
CUDA_HOME /usr/local/cuda-10.1
Pillow 7.2.0
torchvision 0.7.0+cu101 @/home/thanhvt/anaconda3/envs/keras/lib/python3.8/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
fvcore 0.1.1.post20200716
cv2 4.3.0
PyTorch built with: