Open lucasjinreal opened 9 months ago
May I ask that what's your deepspeed's version?
Am using the latest one, Also, why have to using deepspeed? For single GPU is that necessary?
Got a lot of trouble when runing this deepspeed inference
return torch.distributed.all_to_all_single(output=output,
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
return func(*args, **kwargs)
File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3528, in all_to_all_single
work = default_pg.alltoall_base(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
Can you try it again in our recommended setting? the deepspeed version is 0.9.5
Currently a gpu also needs deepspeed initialization, as this facilitates model scaling for those who are making MoE-LLaVA bigger. However, we are trying with non-DDP single GPU inference considering convenience.
Please avoid using deepspeed at the moment, there was recently reported bug related to deepspeed and nccl:
https://github.com/NVIDIA/nccl/issues/1051
And unfortunatelly, it might related to torch 2.1 as well.
So, if continue using deepspeed, for users with decent torch and deepspeed wouldn't able to run it at all
Why not using transformers built-in MoE implementation? transformers supports MoE already
While the project was going on, we found bugs in HF's MoE. so we chose the deepspeed implementation. I'm not really sure it's fixed now.
https://github.com/huggingface/transformers/issues/28093 https://github.com/huggingface/transformers/issues/28255 https://github.com/huggingface/transformers/pull/28115
It was merged to transformers lastest now. Using HF's MoE could reduce many weired problems and make the code cleaner.
Please avoid using deepspeed at the moment, there was recently reported bug related to deepspeed and nccl:
And unfortunatelly, it might related to torch 2.1 as well.
So, if continue using deepspeed, for users with decent torch and deepspeed wouldn't able to run it at all
Try to install torch==2.0.1 and deepspeed==0.9.5 as shown in https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation
@LinB203 I think deepspeed < 0.10.1 can not be installed on python3.8 it has a bug. I tried 0.9.5 -> 0.10.1 none of them can be installed with python3.8.
Reinstall torch introduced a very huge effort since server are fixed torch version with 2.1
Recommend Python >= 3.10 as LLaVA-based methods.
I can run any LLaVa variants except this one.. Don't think python > 3.10 is essential for LLaVa
我也遇到了这个问题,但我没法将torch版本改为2.0.1。
I had this error because I accidentally removed --deepspeed ./scripts/zero2.json \
from the provided finetune script. Alternatively, perhaps adding the following lines at the start of train()
in train.py
may help (not sure if it is the intended way)
import deepspeed
deepspeed.init_distributed(dist_backend='nccl')
I found this MoE runs on DeepSpeed, but deepspeed has issues when runing on server without MPI. Any solution?