facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
29.3k stars 7.32k forks source link

I can't train the model with batch size : 28 in linux environment but I can get the training results in windows with batch size 28 ! #5311

Open eklahari opened 1 week ago

eklahari commented 1 week ago

from register_dataset import* #register custom dataset from detectron2 import model_zoo from detectron2.engine import DefaultPredictor, DefaultTrainer from detectron2.config import get_cfg from detectron2.utils.visualizer import Visualizer from detectron2.data import MetadataCatalog, DatasetCatalog import os

CUDA_LAUNCH_BLOCKING=1. cfg = get_cfg() cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) cfg.MODEL.MASK_ON = False cfg.DATASETS.TRAIN = ("football_train",) cfg.DATASETS.TEST = () cfg.DATALOADER.NUM_WORKERS = 2 cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") cfg.SOLVER.IMS_PER_BATCH = 28 cfg.SOLVER.BASE_LR = 0.00025 cfg.SOLVER.MAX_ITER = 1000 cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 cfg.MODEL.ROI_HEADS.NUM_CLASSES = 5 # Number of classes in the dataset

cfg.OUTPUT_DIR = "/output1"

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) with open(os.path.join(cfg.OUTPUT_DIR, "config.yaml"), "w") as f: f.write(cfg.dump())

trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train() when i am running this code with batch size 28 i am getting cuda error

Screenshot 2024-06-19 at 9 00 44 PM

but i am able to run this file in windows which has same configuration as linux what is issue?how to overcome this could you please provide some code to perform well with increased batch size in linux environment

github-actions[bot] commented 1 week ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

Programmer-RD-AI commented 1 week ago

Hi, This is usually because of the different ways CUDA memory is managed in different environments.

There isn't any specific method to resolve this, but in a Linux environment where you are unable to train a model of batch size of 28, you could try and:

  1. Reduce the Batch Size
  2. Go for a Smaller Model
  3. Using something like torch.cuda.memory_allocated() and torch.cuda.memory_cached() to check up on GPU Memory allocation

These aren't solutions but other possibilities in which you can still train your model in a Linux environment... Hope that explains the issues, If there are any more questions please let me know

Thank you