Provide high-performance model inference, mainly supporting the CodeFuse model from Ant Group.
Compared to the original FT, this repo has these features:
Batch size: 1
Model CodeFuse 13B Measurements Latency (ms) GPU Single A100 2 * A100 Tensor Parallelism Data Type fp16 int8 fp16 int8 Input/Output Length 16 8 160 195 238 84 64 32 608 369 373 295 256 128 2650 1530 1492 1130 1024 512 10776 7054 6786 5415 Tokens Per Sec 48 75 77 98
We run in the container environment: nvcr.io/nvidia/pytorch:22.09-py3
。
pip install --no-cache-dir pybind11==2.6.2 transformers accelerate sentencepiece
echo "export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/" >> ~/.bashrc
export pybind11_DIR=/opt/conda/lib/python3.8/site-packages/pybind11/share/cmake/pybind11/
mkdir build ; cd build
export TORCH_PYTHON_LIBRARIES=/opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so
cmake -DCMAKE_BUILD_TYPE=Release -DSM="80;75" -DBUILD_PYT=ON -DSPARSITY_SUPPORT=OFF -DMEASURE_BUILD_TIME=ON \
-DBUILD_CUTLASS_MIXED_GEMM=ON -DBUILD_MULTI_GPU=ON -DBUILD_TRT=OFF \
-DENABLE_FP8=OFF -DBUILD_PYBIND=ON -DTORCH_PYTHON_LIBRARIES=${TORCH_PYTHON_LIBRARIES} ..
make -j"$(grep -c ^processor /proc/cpuinfo)"
You can use examples/pytorch/codefuse/huggingface_convert.py
script to convert checkpoint files from HuggingFace to FasterTransformer.
export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2
python ../examples/pytorch/codefuse/huggingface_convert.py \
-o ../models/${MODEL_NAME}/fastertransformer \
-i ../models/${MODEL_NAME}/transformers \
-infer_gpu_num ${TENSOR_PARA_SIZE} \
-processes 20 \
-weight_data_type fp16 \
-model_name gptneox
You can use examples/pytorch/codefuse/quant_and_save.py
script to convert fp16 or fp32 FasterTransformer checkpoint files to int8 files and scales, getting higher model load speed and smaller checkpoint files.
export MODEL_NAME=codefuse
export TENSOR_PARA_SIZE=2
python ../examples/pytorch/codefuse/quant_and_save.py \
--in_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu \
--out_dir ../models/${MODEL_NAME}/fastertransformer/${TENSOR_PARA_SIZE}-gpu_int8 \
--lib_path ../build/lib/libth_common.so \
--tensor_para_size ${TENSOR_PARA_SIZE} \
--use_gptj_residual \
--data_type fp16
You can use examples/pytorch/codefuse/codefuse_example.py
to run model inference.
export MODEL_NAME=codefuse
# fp16 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
--ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu \
--tokenizer_path ../models/${MODEL_NAME}/transformers
# int8 1gpu
python ../examples/pytorch/codefuse/codefuse_example.py \
--ckpt_path ../models/${MODEL_NAME}/fastertransformer/1-gpu_int8 \
--tokenizer_path ../models/${MODEL_NAME}/transformers \
--int8_mode 1 \
--enable_int8_weights 1
# fp16 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
--world_size 2 \
--ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu \
--tokenizer_path ../models/${MODEL_NAME}/transformers
# int8 2gpus
torchrun --nproc_per_node 2 ../examples/pytorch/codefuse/codefuse_example.py \
--world_size 2 \
--ckpt_path ../models/${MODEL_NAME}/fastertransformer/2-gpu_int8 \
--tokenizer_path ../models/${MODEL_NAME}/transformers \
--int8_mode 1 \
--enable_int8_weights 1