The script `huggingface_gptneox_convert.py` met problem using tensor parallel.

zhang-ge-hao commented 1 year ago

I try to use the script in this repo to convert gptneox model from huggingface model file to fastertransformer model file.

It worked when I converted files for single-GPU inference. However, when I converted a 2-GPU version of the fastertransformer model file that should work with tensor parallel, it generated nonsensical results.

Model: https://huggingface.co/TabbyML/NeoX-70M

Convert command:

# 1-gpu
python huggingface_gptneox_convert.py \
    -i /input/huggingface/model/path -o /output/fastertransfomrer/model/path -i_g 1 -m_n gptneox
# 2-gpu
python huggingface_gptneox_convert.py \
    -i /input/huggingface/model/path -o /output/fastertransfomrer/model/path -i_g 2 -m_n gptneox

1-gpu FT model result:

[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
====================
latency: 0.011725187301635742
--------------------
prompt: 
--------------------
Game start, 
--------------------
output: 
--------------------

The first thing you notice is that the first thing you notice is that

2-gpu FT model resut:

My script did not judge the rank number before print, so the result was printed twice.

[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x55b3a743e150] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x55b3a6d71000]
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x557fa70e5f20] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x557fa6aca340]
[WARNING] gemm_config.in is not found; using default GEMM algo
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][INFO] NCCL initialized rank=1 world_size=2 tensor_para=NcclParam[rank=1, world_size=2, nccl_comm=0x557fa70e5f20] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x557fa6aca340]
[FT][INFO] NCCL initialized rank=0 world_size=2 tensor_para=NcclParam[rank=0, world_size=2, nccl_comm=0x55b3a743e150] pipeline_para=NcclParam[rank=0, world_size=1, nccl_comm=0x55b3a6d71000]
====================
latency: 0.011738300323486328
--------------------
prompt: 
--------------------
Game start, 
--------------------
output: 
--------------------
,,,,,,,,,,,,,,,,
====================
latency: 0.011530399322509766
--------------------
prompt: 
--------------------
Game start, 
--------------------
output: 
--------------------
,,,,,,,,,,,,,,,,

zhang-ge-hao commented 1 year ago

@wsxiaoys

Hi, thank you for providing such a great script.

Really hope that you could try using the script to transform a model that supports tensor parallelism.

wsxiaoys commented 1 year ago

Hi @AkiyamaYummy,

Unfortunately, the conversion script is maintained solely for Tabby's use case (a single GPU). Therefore, it is unlikely that fixing the tensor parallelism use case for us will be a priority. Maybe you could compare it with the GPT-J script (which works with tensor parallelism) and debug it yourself.

TabbyML / tabby

The script `huggingface_gptneox_convert.py` met problem using tensor parallel. #107