triton fastertransformer server t5 beam search not working?

gyin94 commented 1 year ago

Branch/Tag/Commit

v5.2

Docker Image Version

22.08-py3

GPU name

V100

CUDA Driver

none

Reproduced Steps

use the fastertransformer triton backend by setting the beam width as 4 or 2. The results are the same.

byshiue commented 1 year ago

Can you provide the reproduced steps and the results you observe?

byshiue commented 1 year ago

In t5 triton, the beam width is also set on the fly. We don't use the number of constructor. That's why we set it to 0.

gyin94 commented 1 year ago

yes. you are right about that part. But I can run the successfully with t5 on the triton 22.07 + fastertransformer backend v1.3 + fastertransformer v5.2. However it would break on t5 v1.1 model version. Try this model with beam_width 4 + t5_end_to_end_test.py. It would break.

I1219 07:01:47.681304 93764 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
terminate called after throwing an instance of 'std::runtime_error'
  what():  [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:101

Signal (6) received.
 0# 0x0000562A89BA1C19 in /opt/tritonserver/bin/tritonserver
 1# 0x00007FBA5AAE9090 in /lib/x86_64-linux-gnu/libc.so.6
 2# gsignal in /lib/x86_64-linux-gnu/libc.so.6
 3# abort in /lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007FBA5AEA2911 in /lib/x86_64-linux-gnu/libstdc++.so.6
 5# 0x00007FBA5AEAE38C in /lib/x86_64-linux-gnu/libstdc++.so.6
 6# 0x00007FBA5AEAE3F7 in /lib/x86_64-linux-gnu/libstdc++.so.6
 7# 0x00007FBA5AEAE6A9 in /lib/x86_64-linux-gnu/libstdc++.so.6
 8# 0x00007FB9942C91EC in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
 9# fastertransformer::T5Decoding<__half>::forward(fastertransformer::TensorMap*, fastertransformer::TensorMap*, fastertransformer::T5DecodingWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# T5TritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
11# 0x00007FBA504DB207 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
12# 0x00007FBA5AEDADE4 in /lib/x86_64-linux-gnu/libstdc++.so.6
13# 0x00007FBA5C0D9609 in /lib/x86_64-linux-gnu/libpthread.so.0
14# clone in /lib/x86_64-linux-gnu/libc.so.6

byshiue commented 1 year ago

Can you provide the reproduced steps step-by-step?

gyin94 commented 1 year ago

it is exactly the same with this tutorial.

https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/t5_guide.md#prepare-triton-t5-model-store

change to use git lfs clone https://huggingface.co/Alred/t5-v1_1-small-finetuned-summarization-cnn-ver1.

and then run conversion.

python3 FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -in_file t5-v1_1-small-finetuned-summarization-cnn-ver1/ \
        -saved_dir ${WORKSPACE}/all_models/t5/fastertransformer/1/ \
        -inference_tensor_para_size 1

and start the triton server and then modify and run the t5_end_to_end_test.py or summarization.py with -beam_width 4. One example input that would crash the triton can be this The tower is 324 metres.

gyin94 commented 1 year ago

I am curious whether triton fastertransformer backend supports t5 v1.1 or not. Is it only working for t5 v1 now?

gyin94 commented 1 year ago

note if we run this t5 v1.1 model with the pytorch+Fastertransformer, it is working for beam width > 1. The current wired behavior happened in triton+Fastertransformer

abdallag commented 1 year ago

We have also faced problems trying to use the T5 v1.1 model. Specifically 'google/t5-v1_1-xxl'. I'm not sure the problem is with beam search or with the conversion of the v1.1 model. When we switched back to the older T5 model; specifically 't5-11b', everything worked like a charm.

byshiue commented 1 year ago

I cannot reproduce the issue. Can you try to run

FT_DEBUG_LEVEL=DEBUG tritonserver --model-repository=<your_model>

on main branch again? If there is error, please also post the GPU, docker image you use?

denisyarats commented 1 year ago

T5 v1.1 (google/t5-v1_1-base) with triton+ft produces the incorrect outputs for me (the same token all over again):

matricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatric

the converted weights work fine with the summarization.py script, or with the original T5 model. It looks like something is off on the triton side has anyone faced similar issue?

Chris113113 commented 1 year ago

Also having issues with t5v1.1 crashing Triton. For successful requests I am just seeing absolute garbage as the outputs, but running test_summarize.py actually results in Triton crashing with:

what(): [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:553

Steps for repro: -> Build FT for -DSM=80 -> Convert google/t5-v1_1-base for inference_tensor_para_size=4 -> Run this example: https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/t5_guide.md#run-t5-v11flan-t5mt5

I'm using the Triton base image 22.07 due to an issue with gs storage buckets introduced in 22.08.

NVIDIA / FasterTransformer