Closed xiaohulihutu closed 2 years ago
Hi, thanks for your interest!
I tested the code on Ubuntu20.04 with NCCL backend. Windows does not support the NCCL backend so you have to use gloo. However it does not seem like a Windows-related problem since I got the same error when I run with gloo on Ubuntu 😂
Before I figure this out, a temporary solution would be to use DP instead of DDP which does not require a communication backend. To do this you have to change the parameter of trainer in launch.py
from strategy='ddp_find_unused_parameters_false'
to strategy='dp'
and modify codes related to aggregating multi-gpu outputs in systems/*.py
, namely the validation_epoch_end
and test_epoch_end
function. I'll open a new branch for this DP support very soon and I'll let you know when I figure out this gloo error.
Windows single-GPU training is now supported in my latest commit. Please have a try using the same training command in README.
It is possible to support multi-GPU training on windows using DP, but it requires more code changes:
validation_step_end
and test_step_end
to aggregate results from all GPUsvalidation_epoch_end
and test_epoch_end
as the return value structure coule be differentsee https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu_intermediate.html#dp-caveats for more details.
The Invalid scalar type error encountered when using gloo backend is related to the bool type parameters used in nerfacc. I'll try to fix this issue when I have time.
Thank you very much, that was quick.
Hi there, thank you for sharing, good work.
I want to run the code on windows and it says NCCL error. So i changed the backend from NCCL to GLOO, and an invalid scalar type error pop up.
Do you have any idea why? What is your environment running the code? Mine is python3.10 Cudatoolkit11.3 with torch 1.12.1+cu113
Appreciate!