Closed ProHuper closed 2 years ago
Thanks for filing the issue. Could you provide the output of python -c "import bagua_core; bagua_core.show_version()"
to check the actually NCCL version used?
(base) [root@ts-fadc083f9f7d443e933cc3b7e98478a7-launcher ~]# python -c "import bagua_core; bagua_core.show_version()"
WARNING:root:Bagua cannot detect bundled NCCL library, Bagua will try to use system NCCL instead. If you encounter any error, please run `import bagua_core; bagua_core.install_deps()` or the `bagua_install_deps.py` script to install bundled libraries.
project_name: bagua-core-internal
is_debug: false
version:
pkg_version:0.1.2
branch:master
commit_hash:5228e756
build_time:2021-11-02 16:44:13 +00:00
build_env:rustc 1.56.1 (59eed8a2a 2021-11-01),stable-x86_64-unknown-linux-gnu (default)
tag:
commit_hash: 5228e756b5fac9ed242f05bb1c6ce3edfa201a2f
commit_date: 2021-11-02 16:27:10 +00:00
build_os: linux-x86_64
rust_version: rustc 1.56.1 (59eed8a2a 2021-11-01)
build_time: 2021-11-02 16:44:13 +00:00
NCCL version: 21003
Need to add torch.cuda.set_device(bagua.get_local_rank())
before bagua.init_process_group()
In bagua.init_process_group()
we init NCCL communicator, thus it is needed to set CUDA device before we call it.
BTW we are also working on a DDP compatible API. After https://github.com/BaguaSys/bagua/pull/312 gets merged, it should be a matter of from bagua.torch_api.data_parallel import DistributedDataParallel as DDP
to migrate from DDP to bagua.
Got it, but I still don't know why setting cuda device is necessary to init the NCCL communicator, at least in horovod and torch DDP, there is no such constrain, is there some considerations in this?
Also, I noticed bagua invoked torch's init_process_group in its own init_process_group, what's this for?
# TODO remove the dependency on torch process group
if not dist.is_initialized():
torch.distributed.init_process_group(
backend="nccl",
store=_default_store,
rank=get_rank(),
world_size=get_world_size(),
) # fmt: off
_default_pg = new_group(stream=torch.cuda.Stream(priority=-1))
That's the requirement for ncclCommInitRank
.
Will eventually remove this dependency in future release.
I ran a very simply example and got error:
I used nccl-2.10.3 and cuda-10.2, I'm using local nccl, but same error will encounter when i install nccl using bagua_core.install_deps, and everything works fine if I use DDP.
here's my code: