Specify device_ids in barrier() to force use of a particular device

StevanCakic commented 3 years ago

I have a problem to run modified train_net.py script on multiple GPUs.

Instructions To Reproduce the Issue:

I'm using this dataset as an experiment to test how to run detectron2 training on multiple GPUs with Slurm.

Full runnable code or full changes you made (tools/train_net.py modified) :

#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
"""
A main training script.

This scripts reads a given config file and runs the training or evaluation.
It is an entry point that is made to train standard models in detectron2.

In order to let one script support training of many models,
this script contains logic that are specific to these built-in models and therefore
may not be suitable for your own project.
For example, your research project perhaps only needs a single "evaluator".

Therefore, we recommend you to use detectron2 as an library and take
this file as an example of how to use the library.
You may want to write your own script with your datasets and other customizations.
"""

import logging
import os
from collections import OrderedDict
import torch

import detectron2.utils.comm as comm
from detectron2.checkpoint import DetectionCheckpointer
from detectron2.config import get_cfg
from detectron2.data import MetadataCatalog
from detectron2.engine import DefaultPredictor, DefaultTrainer, default_argument_parser, default_setup, hooks, launch
from detectron2.evaluation import (
    CityscapesInstanceEvaluator,
    CityscapesSemSegEvaluator,
    COCOEvaluator,
    COCOPanopticEvaluator,
    DatasetEvaluators,
    LVISEvaluator,
    PascalVOCDetectionEvaluator,
    SemSegEvaluator,
    verify_results,
)
from detectron2.modeling import GeneralizedRCNNWithTTA

from detectron2 import model_zoo
from detectron2.data.datasets import register_coco_instances
from detectron2.utils.visualizer import Visualizer
import glob
import cv2

# Dodato za SLURM
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
cfg = ""
trainer = ""

class Trainer(DefaultTrainer):
    """
    We use the "DefaultTrainer" which contains pre-defined default logic for
    standard training workflow. They may not work for you, especially if you
    are working on a new research project. In that case you can write your
    own training loop. You can use "tools/plain_train_net.py" as an example.
    """
    @classmethod

    def build_evaluator(cls, cfg, dataset_name, output_folder=None):
        if output_folder is None:
            os.makedirs("coco_eval", exist_ok=True)
            output_folder = "coco_eval"
        return COCOEvaluator(dataset_name, cfg, False, output_folder)

    @classmethod
    def test_with_TTA(cls, cfg, model):
        logger = logging.getLogger("detectron2.trainer")
        # In the end of training, run an evaluation with TTA
        # Only support some R-CNN models.
        logger.info("Running inference with test-time augmentation ...")
        model = GeneralizedRCNNWithTTA(cfg, model)
        evaluators = [
            cls.build_evaluator(
                cfg, name, output_folder=os.path.join(cfg.OUTPUT_DIR, "inference_TTA")
            )
            for name in cfg.DATASETS.TEST
        ]
        res = cls.test(cfg, model, evaluators)
        res = OrderedDict({k + "_TTA": v for k, v in res.items()})
        return res

def setup():
    global cfg
    """
    Create configs and perform basic setups.
    """
    print("START SETUP")
    cfg = get_cfg()
    cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml"))
    cfg.DATASETS.TRAIN = ("my_dataset_train",)
    cfg.DATASETS.TEST = ("my_dataset_val",)
    cfg.DATALOADER.NUM_WORKERS = 4
    cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-Detection/faster_rcnn_X_101_32x8d_FPN_3x.yaml")  # Let training initialize from model zoo
    cfg.SOLVER.IMS_PER_BATCH = 2
    cfg.SOLVER.BASE_LR = 0.001
    cfg.SOLVER.WARMUP_ITERS = 50
    cfg.SOLVER.MAX_ITER = 500 #adjust up if val mAP is still rising, adjust down if overfit
    cfg.SOLVER.STEPS = (50, 450)
    cfg.SOLVER.GAMMA = 0.05
    cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 32
    cfg.MODEL.ROI_HEADS.NUM_CLASSES = 4 #your number of classes + 1
    cfg.TEST.EVAL_PERIOD = 500
    # cfg.merge_from_list(args.opts)
    # cfg.freeze()
    default_setup(cfg, args)
    print("END SETUP")
    return cfg

def main():
    global cfg, trainer
    cfg = setup()

    if args.eval_only:
        print("EVAL_ONLY")
        model = Trainer.build_model(cfg)
        DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(cfg.MODEL.WEIGHTS, resume=args.resume)
        res = Trainer.test(cfg, model)
        if cfg.TEST.AUG.ENABLED:
            res.update(Trainer.test_with_TTA(cfg, model))
        if comm.is_main_process():
            verify_results(cfg, res)
        return res
    print("BEFORE TRAINER")
    trainer = Trainer(cfg)
    trainer.resume_or_load(resume=args.resume)
    if cfg.TEST.AUG.ENABLED:
        print("TEST AUG ENABLED")
        trainer.register_hooks([hooks.EvalHook(0, lambda: trainer.test_with_TTA(cfg, trainer.model))])
    print("BEFORE MAIN END")
    return trainer.train()

if __name__ == "__main__":
    from datetime import datetime

    args = default_argument_parser().parse_args()
    print("Command Line Args:", args)
    register_coco_instances("my_dataset_train", {}, "./data/train/_annotations.coco.json", "./data/train")
    register_coco_instances("my_dataset_val", {}, "./data/valid/_annotations.coco.json", "./data/valid")
    register_coco_instances("my_dataset_test", {}, "./data/test/_annotations.coco.json", "./data/test")

    print("Is CUDA available:", torch.cuda.is_available())
    now = datetime.now()
    print("before launch =", now)

    launch(main, num_gpus_per_machine = args.num_gpus, dist_url = "auto")

    now = datetime.now()
    print("after launch =", now)

    # Evaluate

    from detectron2.data import DatasetCatalog, build_detection_test_loader
    from detectron2.evaluation import inference_on_dataset

    cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.85
    predictor = DefaultPredictor(cfg)
    evaluator = COCOEvaluator("my_dataset_test", cfg, False, output_dir="./output/")
    val_loader = build_detection_test_loader(cfg, "my_dataset_test")
    inference_on_dataset(trainer.model, val_loader, evaluator)

    cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth")
    cfg.DATASETS.TEST = ("my_dataset_test", )
    cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7   # set the testing threshold for this model
    predictor = DefaultPredictor(cfg)
    test_metadata = MetadataCatalog.get("my_dataset_test")

    with open("results.txt", mode="a") as f:
        for imageName in glob.glob('data/test/*jpg'):
            im = cv2.imread(imageName)
            outputs = predictor(im)
            f.write(f"Instances:{outputs['instances']}\n")
            v = Visualizer(im[:, :, ::-1], metadata=test_metadata)
            out = v.draw_instance_predictions(outputs["instances"].to("cpu"))
            cv2.imwrite(f"images/{imageName}",out.get_image()[:, :, ::-1])

On the other hand, I have this slurm script to run an experiment on 2 GPUs:

#!/bin/bash -l

#SBATCH --account=Account
#SBATCH --partition=gpu # gpu partition
#SBATCH --nodes=1 # 1 node, 4 GPUs per node
#SBATCH --time=24:00:00 
#SBATCH --job-name=detectron2_demo4 # job name

module load Python/3.9.5-GCCcore-10.3.0
module load CUDA/11.1.1-GCC-10.2.0

cd /experiment_path

srun python main.py --num-gpus 2

When I ran this script I faced an error (cat slurm-xxx.out), and no error file:

The following have been reloaded with a version change:
  1) GCCcore/10.3.0 => GCCcore/10.2.0
  2) binutils/2.36.1-GCCcore-10.3.0 => binutils/2.35-GCCcore-10.2.0
  3) zlib/1.2.11-GCCcore-10.3.0 => zlib/1.2.11-GCCcore-10.2.0

Command Line Args: Namespace(config_file='', resume=False, eval_only=False, num_gpus=2, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:54190', opts=[])
print("Is CUDA available:", torch.cuda.is_available())
before launch = 2021-08-03 21:26:48.061817

[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Expected behavior:

To run training on 2 GPUs

Environment:

Paste the output of the following command:

No CUDA runtime is found, using CUDA_HOME='/usr/local/software/CUDAcore/11.1.1'
---------------------  --------------------------------------------------------------------------------
sys.platform           linux
Python                 3.9.5 (default, Jul  9 2021, 09:35:24) [GCC 10.3.0]
numpy                  1.21.1
detectron2             0.5 @/home/users/aimhigh/detectron2/detectron2
Compiler               GCC 10.2
CUDA compiler          CUDA 11.1
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.9.0+cu102 @/home/users/aimhigh/.local/lib/python3.9/site-packages/torch
PyTorch debug build    False
GPU available          No: torch.cuda.is_available() == False
Pillow                 8.3.1
torchvision            0.10.0+cu102 @/home/users/aimhigh/.local/lib/python3.9/site-packages/torchvision
fvcore                 0.1.5.post20210727
iopath                 0.1.9
cv2                    4.5.3
---------------------  --------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.1.2 
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2

ppwwyyxx commented 3 years ago

[W ProcessGroupNCCL.cpp:1569]

This is not an error but a warning and is benigh.

We should however try to not print this warning.

ppwwyyxx commented 3 years ago

Expected behavior: To run training on 2 GPUs

What issue is unexpected? Does it not run training? It's not clear from the issue whether it already runs training or not since full log is not given.

The environment info also reports some errors.

StevanCakic commented 3 years ago

@ppwwyyxx Thank you for your answer. After this warning, I can't see any output in my log (tail -f slurm-**.out) so I presume training hasn't started which indicates that something is wrong with the launch** function (probably my code in the main function).

Everything works ok when I tried to run an experiment on one GPU (not with the launch function). Maybe something went wrong with the Slurm configuration, I can't tell for sure. I sent a full log, no outputs after these what I sent -> training hasn't started. If I need to send more details, I'm here.

ppwwyyxx commented 3 years ago

Please uncomment default_setup(cfg, args) and then provide all logs of the run (.out and .err if any)

StevanCakic commented 3 years ago

@ppwwyyxx Hm, strange. I updated main script (inserted some print statements to check if code is executing) and as you can see only one print is executed -> before launch = 2021-08-03 21:26:48.061817 So main script haven't started at all. Is this maybe problematic: dist_url = "auto" . When I haven't put this param, I get some other error and find your suggestion to put this dist_url param with value auto.

StevanCakic commented 3 years ago

@ppwwyyxx I think we need something like this. Probably when I ran this script on Slurm, and set dist_url='auto', it won't work. What do you think about this chunk of code to setup dist_url:

# slurm available
    import os
    if args.world_size == -1 and "SLURM_NPROCS" in os.environ:
        args.world_size = int(os.environ["SLURM_NPROCS"])
        args.rank = int(os.environ["SLURM_PROCID"])
        jobid = os.environ["SLURM_JOBID"]
        hostfile = "dist_url." + jobid  + ".txt"
        if args.dist_file is not None:
            args.dist_url = "file://{}.{}".format(os.path.realpath(args.dist_file), jobid)
        elif args.rank == 0:
            import socket
            ip = socket.gethostbyname(socket.gethostname())
            port = find_free_port()
            args.dist_url = "tcp://{}:{}".format(ip, port)
            with open(hostfile, "w") as f:
                f.write(args.dist_url)
        else:
            import os
            import time
            while not os.path.exists(hostfile):
                time.sleep(1)
            with open(hostfile, "r") as f:
                args.dist_url = f.read()
        print("dist-url:{} at PROCID {} / {}".format(args.dist_url, args.rank, args.world_size))

Here is the reference link

ppwwyyxx commented 3 years ago

The above code is not useful for single-node training.

We still can't reproduce for the reported issue: if launch is not working the issue is likely specific to the environment, e.g. how GPUs are configured. Could you try if #3322 helps?

StevanCakic commented 3 years ago

@ppwwyyxx #3322 solved the problem of displaying a warning message but again this doesn't solve the problem with starting the launch script. Then it's definitely something with our Slurm HPC configuration for GPUs or dist_url. Now the output looks like this:

Command Line Args: Namespace(config_file='', resume=False, eval_only=False, num_gpus=2, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:54190', opts=[])
Is CUDA available: True
before launch = 2021-08-04 11:32:24.247479

Note: I updated the script to check CUDA availability

ppwwyyxx commented 3 years ago

import torch, os

def test_nccl_ops():
    num_gpu = 2

    import torch.multiprocessing as mp
    dist_url = "file:///tmp/nccl_tmp_file"
    mp.spawn(_test_nccl_worker, nprocs=num_gpu, args=(num_gpu, dist_url), daemon=False)
    print("NCCL init succeeded.")

def _test_nccl_worker(rank, num_gpu, dist_url):
    import torch.distributed as dist

    dist.init_process_group(backend="NCCL", init_method=dist_url, rank=rank, world_size=num_gpu)
    dist.barrier()
    print("Worker after barrier")

if __name__ == "__main__":
    test_nccl_ops()

I believe the above is almost equivalent to what launch() does, but without any detectron2 code. You can check if this still fails.

StevanCakic commented 3 years ago

@ppwwyyxx I checked directly launch script and with some attached print statements I conclude that program stucked on mp.spawn

def launch(
    main_func,
    num_gpus_per_machine,
    num_machines=1,
    machine_rank=0,
    dist_url=None,
    args=(),
    timeout=DEFAULT_TIMEOUT,
):
    """
    Launch multi-gpu or distributed training.
    This function must be called on all machines involved in the training.
    It will spawn child processes (defined by ``num_gpus_per_machine``) on each machine.

    Args:
        main_func: a function that will be called by `main_func(*args)`
        num_gpus_per_machine (int): number of GPUs per machine
        num_machines (int): the total number of machines
        machine_rank (int): the rank of this machine
        dist_url (str): url to connect to for distributed jobs, including protocol
                       e.g. "tcp://127.0.0.1:8686".
                       Can be set to "auto" to automatically select a free port on localhost
        timeout (timedelta): timeout of the distributed workers
        args (tuple): arguments passed to main_func
    """
    world_size = num_machines * num_gpus_per_machine
    print("START LAUNCH")
    if world_size > 1:
        # https://github.com/pytorch/pytorch/pull/14391
        # TODO prctl in spawned processes
        print("WORLD SIZE > 1")
        if dist_url == "auto":
            assert num_machines == 1, "dist_url=auto not supported in multi-machine jobs."
            port = _find_free_port()
            dist_url = f"tcp://127.0.0.1:{port}"
            print("DIST URL", dist_url)
        if num_machines > 1 and dist_url.startswith("file://"):
            print("DIST URL FILE", dist_url)
            logger = logging.getLogger(__name__)
            logger.warning(
                "file:// is not a reliable init_method in multi-machine jobs. Prefer tcp://"
            )
        print("PRINT BEFORE SPAWN")
        mp.spawn(
            _distributed_worker,
            nprocs=num_gpus_per_machine,
            args=(
                main_func,
                world_size,
                num_gpus_per_machine,
                machine_rank,
                dist_url,
                args,
                timeout,
            ),
            daemon=False,
        )
        print("PRINT AFTER SPAWN")
    else:
        print("WORLD SIZE == 1")
        main_func(*args)

When program starts launch script, output looks like this

START LAUNCH
WORLD SIZE > 1
DIST URL tcp://127.0.0.1:50598
PRINT BEFORE SPAWN

Now, I will test test_nccl_ops function and let you know about output. Thanks for the help so far.

StevanCakic commented 3 years ago

@ppwwyyxx Definitely something with mp.spawn, same output as before:

Command Line Args: Namespace(config_file='', resume=False, eval_only=False, num_gpus=2, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:54190', opts=[])
Is CUDA available: True
before launch = 2021-08-04 12:11:15.309527
BEFORE TEST NCCL OPS
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.

Maybe dist_url is not correctly generated or maybe it's correctly generated but HPC doesn't allow script/function to run on that specific port.

Also, I tried to check if _test_nccl_worker starts at all, and no (can't see any print value in log):

def _test_nccl_worker(rank, num_gpu, dist_url):
    import torch.distributed as dist
    print("Before init")
    dist.init_process_group(backend="NCCL", init_method=dist_url, rank=rank, world_size=num_gpu)
    print("Before barrier")
    dist.barrier()
    print("Worker after barrier")

StevanCakic commented 3 years ago

@ppwwyyxx Look this for example:

def example_spawn_worker(gpu, queue, event):
    print(f'gpu {gpu} putting into queue')
    queue.put({'gpu': gpu})

    print(f'gpu {gpu} waiting')
    event.wait()

def example_spawn():
    num_gpus = 4

    ################################################################################
    mp.set_start_method('spawn')  # set start method to 'spawn' BEFORE instantiating the queue and the event
    ################################################################################
    queue = mp.Queue()
    event = mp.Event()
    context = mp.spawn(example_spawn_worker, nprocs=num_gpus, args=(queue, event), join=False)
    print('started processes')

    for i in range(num_gpus):
        print(f'getting {i}th queue value')
        d = queue.get()
        print('popped', d)

    event.set()
    context.join()
if __name__ == "__main__":
    print("BEFORE EXAMPLE SPAWN")
    example_spawn()
    print("AFTER EXAMPLE SPAWN")

Output looks like this:

BEFORE EXAMPLE SPAWN
gpu 2 putting into queue
gpu 2 waiting
gpu 1 putting into queue
gpu 1 waiting
gpu 0 putting into queue
gpu 0 waiting
gpu 3 putting into queue
gpu 3 waiting
started processes
getting 0th queue value
popped {'gpu': 2}
getting 1th queue value
popped {'gpu': 3}
getting 2th queue value
popped {'gpu': 0}
getting 3th queue value
popped {'gpu': 1}
AFTER EXAMPLE SPAWN

StevanCakic commented 3 years ago

Also, nvidia-smi doesn't work at the login node. One way is that, while the work is active (while the calculation is being performed), by squeue command I can determine which node is being active, and then logged in to this node with SSH, e.g. ssh gpu01. I started my Slurm script (in this Issue, first post) from the login node.

While the work is active, ssh on this node is possible. Once completed, SSH per node is not possible.

Then I can use nvidia-smi on the GPU node. But to achieve that I need to install such an SSH agent forwarding in the SSH configuration on my machine. Maybe I miss the point with this statement, but I need to mention it

ppwwyyxx commented 3 years ago

Since the hang is reproduced without detectron2, the issue is unrelated to detectron2 so we're closing it. It's more likely an issue of NCCL. You can use this repro (https://github.com/facebookresearch/detectron2/issues/3319#issuecomment-892526273) to report to NCCL or pytorch for help. Running with NCCL_DEBUG=INFO will allow them to help you better.

StevanCakic commented 3 years ago

It was a problem with the Slurm file. So, I solved it by attaching this code in Slurm file:

export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1

facebookresearch / detectron2