PKU-YuanGroup / MoE-LLaVA

Mixture-of-Experts for Large Vision-Language Models
https://arxiv.org/abs/2401.15947
Apache License 2.0
1.97k stars 125 forks source link

/deepspeed/comm/comm.py", line 341, in all_to_all_single return cdb.all_to_all_single(output=output, AttributeError: 'NoneType' object has no attribute 'all_to_all_single' #12

Open lucasjinreal opened 9 months ago

lucasjinreal commented 9 months ago

I found this MoE runs on DeepSpeed, but deepspeed has issues when runing on server without MPI. Any solution?

LinB203 commented 9 months ago

May I ask that what's your deepspeed's version?

lucasjinreal commented 9 months ago

Am using the latest one, Also, why have to using deepspeed? For single GPU is that necessary?

Got a lot of trouble when runing this deepspeed inference

    return torch.distributed.all_to_all_single(output=output,
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/data/miniconda3/envs/env-3.8.8/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3528, in all_to_all_single
    work = default_pg.alltoall_base(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, internal error - please report this issue to the NCCL developers, NCCL version 2.18.6
ncclInternalError: Internal check failed.
Last error:
Attribute busid of node nic not found
LinB203 commented 9 months ago

Can you try it again in our recommended setting? the deepspeed version is 0.9.5

https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation

Currently a gpu also needs deepspeed initialization, as this facilitates model scaling for those who are making MoE-LLaVA bigger. However, we are trying with non-DDP single GPU inference considering convenience.

lucasjinreal commented 9 months ago

Please avoid using deepspeed at the moment, there was recently reported bug related to deepspeed and nccl:

https://github.com/NVIDIA/nccl/issues/1051

And unfortunatelly, it might related to torch 2.1 as well.

So, if continue using deepspeed, for users with decent torch and deepspeed wouldn't able to run it at all

lucasjinreal commented 9 months ago

Why not using transformers built-in MoE implementation? transformers supports MoE already

LinB203 commented 9 months ago

While the project was going on, we found bugs in HF's MoE. so we chose the deepspeed implementation. I'm not really sure it's fixed now.

https://github.com/huggingface/transformers/issues/28093 https://github.com/huggingface/transformers/issues/28255 https://github.com/huggingface/transformers/pull/28115

lucasjinreal commented 9 months ago

It was merged to transformers lastest now. Using HF's MoE could reduce many weired problems and make the code cleaner.

LinB203 commented 9 months ago

Please avoid using deepspeed at the moment, there was recently reported bug related to deepspeed and nccl:

NVIDIA/nccl#1051

And unfortunatelly, it might related to torch 2.1 as well.

So, if continue using deepspeed, for users with decent torch and deepspeed wouldn't able to run it at all

Try to install torch==2.0.1 and deepspeed==0.9.5 as shown in https://github.com/PKU-YuanGroup/MoE-LLaVA?tab=readme-ov-file#%EF%B8%8F-requirements-and-installation

lucasjinreal commented 9 months ago

@LinB203 I think deepspeed < 0.10.1 can not be installed on python3.8 it has a bug. I tried 0.9.5 -> 0.10.1 none of them can be installed with python3.8.

Reinstall torch introduced a very huge effort since server are fixed torch version with 2.1

LinB203 commented 9 months ago

Recommend Python >= 3.10 as LLaVA-based methods.

lucasjinreal commented 9 months ago

I can run any LLaVa variants except this one.. Don't think python > 3.10 is essential for LLaVa

gaoyuanzizizi commented 7 months ago

我也遇到了这个问题,但我没法将torch版本改为2.0.1。

CharlieFRuan commented 6 months ago

I had this error because I accidentally removed --deepspeed ./scripts/zero2.json \ from the provided finetune script. Alternatively, perhaps adding the following lines at the start of train() in train.py may help (not sure if it is the intended way)

import deepspeed
deepspeed.init_distributed(dist_backend='nccl')