Open gyin94 opened 1 year ago
Can you provide the reproduced steps and the results you observe?
In t5 triton, the beam width is also set on the fly. We don't use the number of constructor. That's why we set it to 0.
yes. you are right about that part. But I can run the successfully with t5 on the triton 22.07 + fastertransformer backend v1.3 + fastertransformer v5.2. However it would break on t5 v1.1 model version. Try this model with beam_width 4 + t5_end_to_end_test.py. It would break.
I1219 07:01:47.681304 93764 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002
terminate called after throwing an instance of 'std::runtime_error'
what(): [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/utils/memory_utils.cu:101
Signal (6) received.
0# 0x0000562A89BA1C19 in /opt/tritonserver/bin/tritonserver
1# 0x00007FBA5AAE9090 in /lib/x86_64-linux-gnu/libc.so.6
2# gsignal in /lib/x86_64-linux-gnu/libc.so.6
3# abort in /lib/x86_64-linux-gnu/libc.so.6
4# 0x00007FBA5AEA2911 in /lib/x86_64-linux-gnu/libstdc++.so.6
5# 0x00007FBA5AEAE38C in /lib/x86_64-linux-gnu/libstdc++.so.6
6# 0x00007FBA5AEAE3F7 in /lib/x86_64-linux-gnu/libstdc++.so.6
7# 0x00007FBA5AEAE6A9 in /lib/x86_64-linux-gnu/libstdc++.so.6
8# 0x00007FB9942C91EC in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
9# fastertransformer::T5Decoding<__half>::forward(fastertransformer::TensorMap*, fastertransformer::TensorMap*, fastertransformer::T5DecodingWeight<__half> const*) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
10# T5TritonModelInstance<__half>::forward(std::shared_ptr<std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, triton::Tensor, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, triton::Tensor> > > >) in /opt/tritonserver/backends/fastertransformer/libtransformer-shared.so
11# 0x00007FBA504DB207 in /opt/tritonserver/backends/fastertransformer/libtriton_fastertransformer.so
12# 0x00007FBA5AEDADE4 in /lib/x86_64-linux-gnu/libstdc++.so.6
13# 0x00007FBA5C0D9609 in /lib/x86_64-linux-gnu/libpthread.so.0
14# clone in /lib/x86_64-linux-gnu/libc.so.6
Can you provide the reproduced steps step-by-step?
it is exactly the same with this tutorial.
change to use git lfs clone https://huggingface.co/Alred/t5-v1_1-small-finetuned-summarization-cnn-ver1
.
and then run conversion.
python3 FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
-in_file t5-v1_1-small-finetuned-summarization-cnn-ver1/ \
-saved_dir ${WORKSPACE}/all_models/t5/fastertransformer/1/ \
-inference_tensor_para_size 1
and start the triton server
and then modify and run the t5_end_to_end_test.py
or summarization.py
with -beam_width 4
. One example input that would crash the triton can be this The tower is 324 metres
.
I am curious whether triton fastertransformer backend supports t5 v1.1 or not. Is it only working for t5 v1 now?
note if we run this t5 v1.1 model with the pytorch+Fastertransformer, it is working for beam width > 1. The current wired behavior happened in triton+Fastertransformer
We have also faced problems trying to use the T5 v1.1 model. Specifically 'google/t5-v1_1-xxl'. I'm not sure the problem is with beam search or with the conversion of the v1.1 model. When we switched back to the older T5 model; specifically 't5-11b', everything worked like a charm.
I cannot reproduce the issue. Can you try to run
FT_DEBUG_LEVEL=DEBUG tritonserver --model-repository=<your_model>
on main branch again? If there is error, please also post the GPU, docker image you use?
T5 v1.1 (google/t5-v1_1-base) with triton+ft produces the incorrect outputs for me (the same token all over again):
matricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatricmatric
the converted weights work fine with the summarization.py
script, or with the original T5 model. It looks like something is off on the triton side has anyone faced similar issue?
Also having issues with t5v1.1 crashing Triton. For successful requests I am just seeing absolute garbage as the outputs, but running test_summarize.py
actually results in Triton crashing with:
what(): [FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/build/fastertransformer_backend/build/_deps/repo-ft-src/src/fastertransformer/models/t5/T5Decoding.cc:553
Steps for repro: -> Build FT for -DSM=80 -> Convert google/t5-v1_1-base for inference_tensor_para_size=4 -> Run this example: https://github.com/triton-inference-server/fastertransformer_backend/blob/main/docs/t5_guide.md#run-t5-v11flan-t5mt5
I'm using the Triton base image 22.07 due to an issue with gs storage buckets introduced in 22.08.
Branch/Tag/Commit
v5.2
Docker Image Version
22.08-py3
GPU name
V100
CUDA Driver
none
Reproduced Steps