Open Ayadx opened 1 week ago
You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs"; "Your Environment";
Hi, Check the Issues #2442 & #2473 Example on How to implement multi GPU training Hope this helps, If there are further questions please feel free to comment :) Best regards, Ranuga
Hi, am trying to use multi-GPU training using kaggle with two Tesla T4. my code only runs on 1 GPU, the other are not utilized. I am able to train with custom dataset and getting acceptable results, but wish to use 2 GPUs for faster training.
i am using this but is not working: "python -m torch.distributed.launch --nproc_per_node=2 train_yolo.py"
Full runnable code or full changes you made:
` import os import json import multiprocessing as mp from detectron2.engine import DefaultTrainer, HookBase from detectron2.config import get_cfg from detectron2 import model_zoo from detectron2.evaluation import COCOEvaluator, inference_on_dataset from detectron2.data import build_detection_test_loader, DatasetCatalog, MetadataCatalog from detectron2.structures import BoxMode
Define the training script content
script_content = """ import os import json import multiprocessing as mp from detectron2.engine import DefaultTrainer, HookBase from detectron2.config import get_cfg from detectron2 import model_zoo from detectron2.evaluation import COCOEvaluator, inference_on_dataset from detectron2.data import build_detection_test_loader, DatasetCatalog, MetadataCatalog from detectron2.structures import BoxMode
Unregister the datasets if they are already registered
for d in ["pv_anomaly_train", "pv_anomaly_val", "pv_anomaly_test"]: if d in DatasetCatalog.list(): DatasetCatalog.remove(d) if d in MetadataCatalog.list(): MetadataCatalog.remove(d)
def load_coco_json(json_file, image_root, dataset_name): with open(json_file) as f: imgs_anns = json.load(f)
def register_datasets(): DatasetCatalog.register( "pv_anomaly_train", lambda: load_coco_json( "/kaggle/working/0PVProjects/Univpm_DataSet/labels/train_annotations.json", "/kaggle/working/0PVProjects/Univpm_DataSet/images/train_combined_data", "pv_anomaly_train" ) ) MetadataCatalog.get("pv_anomaly_train").set(thing_classes=["anomaly"])
def set_multiprocessing_start_method(): try: mp.set_start_method('spawn', force=True) except RuntimeError as e: if "context has already been set" in str(e): print("Multiprocessing context already set, continuing without changing start method.") else: raise
class PrintMetricsHook(HookBase): def init(self, cfg): self.cfg = cfg
class MyTrainer(DefaultTrainer): @classmethod def build_evaluator(cls, cfg, dataset_name): return COCOEvaluator(dataset_name, cfg, False, output_dir=cfg.OUTPUT_DIR)
def main(): register_datasets() set_multiprocessing_start_method()
if name == "main": main() """
Write the script to a file
script_path = '/kaggle/working/train_yolo.py' with open(script_path, 'w') as f: f.write(script_content)
Define the command to run the training script using torch.distributed.run
train_command = f""" python -m torch.distributed.run --nproc_per_node=2 {script_path} """
Execute the training command
os.system(train_command)
` best regards!