ali-vilab / VGen

Official repo for VGen: a holistic video generation ecosystem for video generation building on diffusion models
https://i2vgen-xl.github.io
2.9k stars 258 forks source link

Distributed package doesn't have NCCL built in #101

Open 23Rj20 opened 6 months ago

23Rj20 commented 6 months ago

I am using windows 11 with 16gb A4000 GPU

Error after runnin the command "python train_net.py --cfg configs/t2v_train.yaml": WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.2.0+cu121 with CUDA 1201 (you have 2.2.1) Python 3.8.10 (you have 3.8.18) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details A matching Triton is not available, some optimizations will not be enabled Traceback (most recent call last): File "C:\Users\INP_Rohit.conda\envs\vgen\lib\site-packages\xformers__init__.py", line 55, in _is_triton_available from xformers.triton.softmax import softmax as triton_softmax # noqa File "C:\Users\INP_Rohit.conda\envs\vgen\lib\site-packages\xformers\triton\softmax.py", line 11, in import triton ModuleNotFoundError: No module named 'triton' Traceback (most recent call last): File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\utils\registry.py", line 67, in build_from_config return req_type_entry(*cfg) File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\tools\train\train_t2v_enterance.py", line 59, in train_t2v_entrance worker(0, cfg) File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\tools\train\train_t2v_enterance.py", line 75, in worker dist.init_process_group(backend='nccl', world_size=cfg.world_size, rank=cfg.rank) File "C:\Users\INP_Rohit.conda\envs\vgen\lib\site-packages\torch\distributed\c10d_logger.py", line 86, in wrapper func_return = func(args, **kwargs) File "C:\Users\INP_Rohit.conda\envs\vgen\lib\site-packages\torch\distributed\distributed_c10d.py", line 1184, in init_process_group defaultpg, = _new_process_group_helper( File "C:\Users\INP_Rohit.conda\envs\vgen\lib\site-packages\torch\distributed\distributed_c10d.py", line 1302, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL built in") RuntimeError: Distributed package doesn't have NCCL built in

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "train_net.py", line 18, in ENGINE.build(dict(type=cfg_update.TASK_TYPE), cfg_update=cfg_update.cfg_dict) File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\utils\registry.py", line 107, in build return self.build_func(*args, kwargs, registry=self) File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\utils\registry_class.py", line 7, in build_func return build_from_config(cfg, registry, kwargs) File "C:\Users\INP_Rohit\Documents\ImageGeneration\i2vgen-xl\utils\registry.py", line 69, in build_from_config raise Exception(f"Failed to invoke function {req_type_entry}, with {e}") Exception: Failed to invoke function <function train_t2v_entrance at 0x00000285FCD03AF0>, with Distributed package doesn't have NCCL built in