Run on windows - Githubissues

bennyguo / instant-nsr-pl

Neural Surface reconstruction based on Instant-NGP. Efficient and customizable boilerplate for your research projects. Train NeuS in 10min!

MIT License

857 stars 84 forks source link

Run on windows #4

Closed xiaohulihutu closed 2 years ago

xiaohulihutu commented 2 years ago

Hi there, thank you for sharing, good work.

I want to run the code on windows and it says NCCL error. So i changed the backend from NCCL to GLOO, and an invalid scalar type error pop up.

Do you have any idea why? What is your environment running the code? Mine is python3.10 Cudatoolkit11.3 with torch 1.12.1+cu113

Appreciate!

bennyguo commented 2 years ago

Hi, thanks for your interest!

I tested the code on Ubuntu20.04 with NCCL backend. Windows does not support the NCCL backend so you have to use gloo. However it does not seem like a Windows-related problem since I got the same error when I run with gloo on Ubuntu 😂

Before I figure this out, a temporary solution would be to use DP instead of DDP which does not require a communication backend. To do this you have to change the parameter of trainer in launch.py from strategy='ddp_find_unused_parameters_false' to strategy='dp' and modify codes related to aggregating multi-gpu outputs in systems/*.py, namely the validation_epoch_end and test_epoch_end function. I'll open a new branch for this DP support very soon and I'll let you know when I figure out this gloo error.

bennyguo commented 2 years ago

Windows single-GPU training is now supported in my latest commit. Please have a try using the same training command in README.

It is possible to support multi-GPU training on windows using DP, but it requires more code changes:

implement validation_step_end and test_step_end to aggregate results from all GPUs
possible modification in validation_epoch_end and test_epoch_end as the return value structure coule be different

see https://pytorch-lightning.readthedocs.io/en/latest/accelerators/gpu_intermediate.html#dp-caveats for more details.

The Invalid scalar type error encountered when using gloo backend is related to the bool type parameters used in nerfacc. I'll try to fix this issue when I have time.

xiaohulihutu commented 2 years ago

Thank you very much, that was quick.