Closed StevanCakic closed 3 years ago
[W ProcessGroupNCCL.cpp:1569]
This is not an error but a warning and is benigh.
We should however try to not print this warning.
Expected behavior: To run training on 2 GPUs
What issue is unexpected? Does it not run training? It's not clear from the issue whether it already runs training or not since full log is not given.
The environment info also reports some errors.
@ppwwyyxx Thank you for your answer. After this warning, I can't see any output in my log (tail -f slurm-**.out) so I presume training hasn't started which indicates that something is wrong with the launch** function (probably my code in the main function).
Everything works ok when I tried to run an experiment on one GPU (not with the launch function). Maybe something went wrong with the Slurm configuration, I can't tell for sure. I sent a full log, no outputs after these what I sent -> training hasn't started. If I need to send more details, I'm here.
Please uncomment default_setup(cfg, args)
and then provide all logs of the run (.out and .err if any)
@ppwwyyxx Hm, strange. I updated main script (inserted some print statements to check if code is executing) and as you can see only one print is executed -> before launch = 2021-08-03 21:26:48.061817 So main script haven't started at all. Is this maybe problematic: dist_url = "auto" . When I haven't put this param, I get some other error and find your suggestion to put this dist_url param with value auto.
@ppwwyyxx I think we need something like this. Probably when I ran this script on Slurm, and set dist_url='auto', it won't work. What do you think about this chunk of code to setup dist_url:
# slurm available
import os
if args.world_size == -1 and "SLURM_NPROCS" in os.environ:
args.world_size = int(os.environ["SLURM_NPROCS"])
args.rank = int(os.environ["SLURM_PROCID"])
jobid = os.environ["SLURM_JOBID"]
hostfile = "dist_url." + jobid + ".txt"
if args.dist_file is not None:
args.dist_url = "file://{}.{}".format(os.path.realpath(args.dist_file), jobid)
elif args.rank == 0:
import socket
ip = socket.gethostbyname(socket.gethostname())
port = find_free_port()
args.dist_url = "tcp://{}:{}".format(ip, port)
with open(hostfile, "w") as f:
f.write(args.dist_url)
else:
import os
import time
while not os.path.exists(hostfile):
time.sleep(1)
with open(hostfile, "r") as f:
args.dist_url = f.read()
print("dist-url:{} at PROCID {} / {}".format(args.dist_url, args.rank, args.world_size))
Here is the reference link
The above code is not useful for single-node training.
We still can't reproduce for the reported issue: if launch
is not working the issue is likely specific to the environment, e.g. how GPUs are configured. Could you try if #3322 helps?
@ppwwyyxx #3322 solved the problem of displaying a warning message but again this doesn't solve the problem with starting the launch script. Then it's definitely something with our Slurm HPC configuration for GPUs or dist_url. Now the output looks like this:
Command Line Args: Namespace(config_file='', resume=False, eval_only=False, num_gpus=2, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:54190', opts=[])
Is CUDA available: True
before launch = 2021-08-04 11:32:24.247479
Note: I updated the script to check CUDA availability
import torch, os
def test_nccl_ops():
num_gpu = 2
import torch.multiprocessing as mp
dist_url = "file:///tmp/nccl_tmp_file"
mp.spawn(_test_nccl_worker, nprocs=num_gpu, args=(num_gpu, dist_url), daemon=False)
print("NCCL init succeeded.")
def _test_nccl_worker(rank, num_gpu, dist_url):
import torch.distributed as dist
dist.init_process_group(backend="NCCL", init_method=dist_url, rank=rank, world_size=num_gpu)
dist.barrier()
print("Worker after barrier")
if __name__ == "__main__":
test_nccl_ops()
I believe the above is almost equivalent to what launch()
does, but without any detectron2 code. You can check if this still fails.
@ppwwyyxx I checked directly launch script and with some attached print statements I conclude that program stucked on mp.spawn
def launch(
main_func,
num_gpus_per_machine,
num_machines=1,
machine_rank=0,
dist_url=None,
args=(),
timeout=DEFAULT_TIMEOUT,
):
"""
Launch multi-gpu or distributed training.
This function must be called on all machines involved in the training.
It will spawn child processes (defined by ``num_gpus_per_machine``) on each machine.
Args:
main_func: a function that will be called by `main_func(*args)`
num_gpus_per_machine (int): number of GPUs per machine
num_machines (int): the total number of machines
machine_rank (int): the rank of this machine
dist_url (str): url to connect to for distributed jobs, including protocol
e.g. "tcp://127.0.0.1:8686".
Can be set to "auto" to automatically select a free port on localhost
timeout (timedelta): timeout of the distributed workers
args (tuple): arguments passed to main_func
"""
world_size = num_machines * num_gpus_per_machine
print("START LAUNCH")
if world_size > 1:
# https://github.com/pytorch/pytorch/pull/14391
# TODO prctl in spawned processes
print("WORLD SIZE > 1")
if dist_url == "auto":
assert num_machines == 1, "dist_url=auto not supported in multi-machine jobs."
port = _find_free_port()
dist_url = f"tcp://127.0.0.1:{port}"
print("DIST URL", dist_url)
if num_machines > 1 and dist_url.startswith("file://"):
print("DIST URL FILE", dist_url)
logger = logging.getLogger(__name__)
logger.warning(
"file:// is not a reliable init_method in multi-machine jobs. Prefer tcp://"
)
print("PRINT BEFORE SPAWN")
mp.spawn(
_distributed_worker,
nprocs=num_gpus_per_machine,
args=(
main_func,
world_size,
num_gpus_per_machine,
machine_rank,
dist_url,
args,
timeout,
),
daemon=False,
)
print("PRINT AFTER SPAWN")
else:
print("WORLD SIZE == 1")
main_func(*args)
When program starts launch script, output looks like this
START LAUNCH
WORLD SIZE > 1
DIST URL tcp://127.0.0.1:50598
PRINT BEFORE SPAWN
Now, I will test test_nccl_ops function and let you know about output. Thanks for the help so far.
@ppwwyyxx Definitely something with mp.spawn, same output as before:
Command Line Args: Namespace(config_file='', resume=False, eval_only=False, num_gpus=2, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:54190', opts=[])
Is CUDA available: True
before launch = 2021-08-04 12:11:15.309527
BEFORE TEST NCCL OPS
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Maybe dist_url is not correctly generated or maybe it's correctly generated but HPC doesn't allow script/function to run on that specific port.
Also, I tried to check if _test_nccl_worker starts at all, and no (can't see any print value in log):
def _test_nccl_worker(rank, num_gpu, dist_url):
import torch.distributed as dist
print("Before init")
dist.init_process_group(backend="NCCL", init_method=dist_url, rank=rank, world_size=num_gpu)
print("Before barrier")
dist.barrier()
print("Worker after barrier")
@ppwwyyxx Look this for example:
def example_spawn_worker(gpu, queue, event):
print(f'gpu {gpu} putting into queue')
queue.put({'gpu': gpu})
print(f'gpu {gpu} waiting')
event.wait()
def example_spawn():
num_gpus = 4
################################################################################
mp.set_start_method('spawn') # set start method to 'spawn' BEFORE instantiating the queue and the event
################################################################################
queue = mp.Queue()
event = mp.Event()
context = mp.spawn(example_spawn_worker, nprocs=num_gpus, args=(queue, event), join=False)
print('started processes')
for i in range(num_gpus):
print(f'getting {i}th queue value')
d = queue.get()
print('popped', d)
event.set()
context.join()
if __name__ == "__main__":
print("BEFORE EXAMPLE SPAWN")
example_spawn()
print("AFTER EXAMPLE SPAWN")
Output looks like this:
BEFORE EXAMPLE SPAWN
gpu 2 putting into queue
gpu 2 waiting
gpu 1 putting into queue
gpu 1 waiting
gpu 0 putting into queue
gpu 0 waiting
gpu 3 putting into queue
gpu 3 waiting
started processes
getting 0th queue value
popped {'gpu': 2}
getting 1th queue value
popped {'gpu': 3}
getting 2th queue value
popped {'gpu': 0}
getting 3th queue value
popped {'gpu': 1}
AFTER EXAMPLE SPAWN
Also, nvidia-smi doesn't work at the login node. One way is that, while the work is active (while the calculation is being performed), by squeue command I can determine which node is being active, and then logged in to this node with SSH, e.g. ssh gpu01. I started my Slurm script (in this Issue, first post) from the login node.
While the work is active, ssh on this node is possible. Once completed, SSH per node is not possible.
Then I can use nvidia-smi on the GPU node. But to achieve that I need to install such an SSH agent forwarding in the SSH configuration on my machine. Maybe I miss the point with this statement, but I need to mention it
Since the hang is reproduced without detectron2, the issue is unrelated to detectron2 so we're closing it.
It's more likely an issue of NCCL. You can use this repro (https://github.com/facebookresearch/detectron2/issues/3319#issuecomment-892526273) to report to NCCL or pytorch for help. Running with NCCL_DEBUG=INFO
will allow them to help you better.
It was a problem with the Slurm file. So, I solved it by attaching this code in Slurm file:
export NCCL_SOCKET_IFNAME=bond0
export NCCL_IB_DISABLE=1
export NCCL_P2P_DISABLE=1
I have a problem to run modified train_net.py script on multiple GPUs.
Instructions To Reproduce the Issue:
I'm using this dataset as an experiment to test how to run detectron2 training on multiple GPUs with Slurm.
On the other hand, I have this slurm script to run an experiment on 2 GPUs:
When I ran this script I faced an error (cat slurm-xxx.out), and no error file:
Expected behavior:
To run training on 2 GPUs
Environment:
Paste the output of the following command: