Open michaelroyzen opened 1 year ago
Can you introduce what's difference between GPT-J and LLaMA?
+1 for this
They look very similar. In HuggingFace's doc page, they say that the implementation is based on the GPT-NeoX codebase, which seems to be supported by FasterTransformer: https://huggingface.co/docs/transformers/main/model_doc/llama.
Do you think it'll work?
+1
@byshiue According to our investigation, it is not difficult to portal this model to Megatron as well. But I am not sure will one convert script works.
Thank you for the suggestion and discussion. We may not have time to work on that issue right now. If you are interesting, you can try to support it. It is welcome to ask question if you encounter any question, and merge back into our repo if you can support it.
+1 for this
It seems to be quite a simple implementation @byshiue. All that needs to be done is implement RMS layer norm in GPT-NeoX, as well as support the SILU activation. It seem that both of these features are already implemented elsewhere in FasterTransformer.
I'd be happy to take the lead if you can help me with the general steps.
+1 for this
+1 for this
I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows
def forward(self, x):
return self.w2(F.silu(self.w1(x)) * self.w3(x))
I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue
I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows
def forward(self, x): return self.w2(F.silu(self.w1(x)) * self.w3(x))
I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue
It looks like a standard gated silu. Can you explain what difference do you think?
Thanks for the reminder, I missed this part. I will try to make this work
Wow, thank you @moonscar. Want any help? What's the status of your PR?
need this too
moonscar
have you started this work? or i can help with it.
Don't think it's been started yet @Anychnn
Given the interest and activity here, I'd like to offer a bounty of $2,500 USD to whoever can get Llama implemented in FT. Please email me at michael@phind.com if you're interested. @moonscar @AnShengqiang @Anychnn @byshiue
It seems that all that needs to be done is copy over T5's RMS layer norm (already implemented in FT) and UL2's gated-silu (also already implemented elsewhere in FT) into GPT-NeoX. As per the Huggingface's implementation of Llama, it is otherwise completely identical to GPT-NeoX (which is already implemented in FT).
The bounty will be $3,000 if a correct and working PR is opened by the end of Friday, April 21st (Pacific Time).
would be glad to help do a part of the work, for example converting the weights to FT
Made alot of progress on this, but my current FT model is outputting seemingly random tokens, so there's something wrong with my weight conversion or maybe even the exact layer implementation. If someone wants to pick up the torch (I am done for now 😞) the next step would prob be to compare layer-by-layer the output of the Huggingface model vs. this FT model:
Weights conversion: https://github.com/cameronfr/FasterTransformer/blob/main/examples/cpp/llama/huggingface_llama_convert.py FT Model: https://github.com/cameronfr/FasterTransformer/tree/main/src/fastertransformer/models/llama Testing: https://github.com/cameronfr/FasterTransformer/tree/main/examples/cpp/llama
Everything is modified from the respective GPTNeoX versions. LlamaContextDecoder
and LlamaDecoder
essentially just have the changes of Gelu -> Gated Silu and LayerNorm -> LayerNormT5. LlamaDecoderLayerWeight
and LlamaWeight
set the parameters of these layers.
@cameronfr The default layernormeps of llama.h is set to be 1e-5, but llama-7b-torch set it default to 1e-6. And the attention module output is also incorrect, I am fixing this.
@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97
Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101
So I tried something like :
qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size)
and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence.
Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.
Great progress @cameronfr @Anychnn @jinluyang. I'm doubling the bounty to $6k to whoever can get this working and merged in.
Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could get it merged?
Nice! Works well so far in limited tests and is consistent with the Huggingface output using beam_size 1. One comment is that it should support max_position_embeddings (max_pos_seq_len in FT), but this is likely a simple change. Will continue testing and post the updates here.
@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py
@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py
Use the merge_adapter
interface can merge lora weights into original linear weights. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py#L279
Hey community, here are some updates:
Hey, a tutorial on how to run LLaMA with the FasterTransformer backend would be really helpful! would be happy to contribute
Hey, a tutorial on how to run LLaMA with the FasterTransformer backend would be really helpful! would be happy to contribute
sure, will provide a step by step tutorial on how to run HF llama later
@void-main I checked llama-13b inference result on FasterTransformer,it is about 8 second per request on A100, the greedy-search result is consistent with huggingface. good job!
That's quiet fast @Anychnn , could you tell me the steps you did to get it running briefly ? Thanks !!
I'd love to see quantization such as in GPT-Q, what amazing work guys thank you all! ❤️
@void-main I checked llama-13b inference result on FasterTransformer,it is about 8 second per request on A100, the greedy-search result is consistent with huggingface. good job!
Hi @Anychnn , how much time does it take per request on A100 with huggingface implementation?
Does FasterTransformer support quantization?
there seems to existing two version of the huggingface LLaMA weights converter, the older one had issues with BOS, EOS tokens, the newer converter fixes that issues, which versions of LLaMA (converted using v1 converter, or converted using v2 converter) does this PR work with ? Thanks
Does FasterTransformer support quantization?
i believe the best option we have is INT8 weight only qauntization
, which is supported by FT (but not in Llama implementation)
there seems to existing two version of the huggingface LLaMA weights converter, the older one had issues with BOS, EOS tokens, the newer converter fixes that issues, which versions of LLaMA (converted using v1 converter, or converted using v2 converter) does this PR work with ? Thanks
@SupreethRao99 could you please point me to these converters?
@void-main , the converters can be found here https://github.com/huggingface/transformers/commits/main/src/transformers/models/llama/convert_llama_weights_to_hf.py if we take a look a its commit history, there's a fix to the tokenizer on 3rd April 2023. v1 seems to be the converter before this date and v2 to be after this date.
FT supports the newer converter @SupreethRao99
Hey, I'm trying to run LLaMA with the fastertransformer backend on a triton inference server, I am closely following this tutorial (https://towardsdatascience.com/deploy-your-local-gpt-server-with-triton-a825d528aa5d) and made the following changes.
I changed the docker file in the faster-transformer-backend to
# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
ARG TRITON_VERSION=23.04
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:${TRITON_VERSION}-py3
FROM ${BASE_IMAGE}
RUN apt-get update
RUN apt-get install -y --no-install-recommends \
autoconf \
autogen \
clangd \
cmake \
gdb \
git-lfs \
libb64-dev \
libz-dev \
locales-all \
mosh \
openssh-server \
python3-dev \
rapidjson-dev \
sudo \
tmux \
unzip \
xz-utils \
zstd \
zip \
zsh
RUN pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 && \
pip3 install --extra-index-url https://pypi.ngc.nvidia.com regex fire ipywidgets tritonclient[all] && \
pip3 install transformers huggingface_hub tokenizers SentencePiece sacrebleu datasets tqdm omegaconf rouge_score && \
pip3 install cmake==3.24.3
RUN apt-get clean && \
rm -rf /var/lib/apt/lists/*
# backend build
ADD . /workspace/build/fastertransformer_backend
RUN mkdir -p /workspace/build/fastertransformer_backend/build
WORKDIR /workspace/build/fastertransformer_backend/build
ARG FORCE_BACKEND_REBUILD=0
RUN cmake \
-D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
-D CMAKE_BUILD_TYPE=Release \
-D ENABLE_FP8=OFF \
-D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
-D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
-D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
-D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
..
RUN cd _deps/repo-ft-src/ && \
git log | head -n 3 2>&1 | tee /workspace/build/fastertransformer_backend/FT_version.txt && \
cd /workspace/build/fastertransformer_backend/build && \
make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
rm /workspace/build/fastertransformer_backend/build/bin/*_example -rf && \
rm /workspace/build/fastertransformer_backend/build/lib/lib*Backend.so -rf
ENV NCCL_LAUNCH_MODE=GROUP
ENV WORKSPACE /workspace
WORKDIR /workspace
RUN sed -i 's/#X11UseLocalhost yes/X11UseLocalhost no/g' /etc/ssh/sshd_config && \
mkdir /var/run/sshd -p
RUN ln -sf /usr/bin/python3.8 /usr/bin/python
I then cloned the FasterTransformer library and pulled the new llama additions from the pull request associated with this issues #575 .
I then followed the repository till the end and used the following config.pbtext from the gpt-j example
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "vicuna-13b"
max_batch_size: 1024
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "start_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "prompt_learning_task_name_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_reset_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_UINT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "4"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "model_type"
value: {
string_value: "vicuna-13b"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "vicuna-13b/4-gpu/"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
I had previously converted the model to the fastertransformer format by the script provided as part of the PR. I then ran /opt/tritonserver/bin/tritonserver --model-repository=./vicuna-13b-ft
and got the following error
I0506 03:39:13.088086 14496 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0fe2000000' with size 268435456
I0506 03:39:13.091928 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0506 03:39:13.091940 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0506 03:39:13.091948 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0506 03:39:13.091955 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
W0506 03:39:13.556236 14496 server.cc:237] failed to enable peer access for some device pairs
E0506 03:39:13.568928 14496 model_repository_manager.cc:1245] Poll failed for model directory '4-gpu': Invalid model name: Could not determine backend for model '4-gpu' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
I0506 03:39:13.568997 14496 server.cc:583]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0506 03:39:13.569016 14496 server.cc:610]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
I0506 03:39:13.569030 14496 server.cc:653]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+
I0506 03:39:13.619927 14496 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4
I0506 03:39:13.619983 14496 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4
I0506 03:39:13.619996 14496 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4
I0506 03:39:13.620008 14496 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4
I0506 03:39:13.621315 14496 metrics.cc:701] Collecting CPU metrics
I0506 03:39:13.621599 14496 tritonserver.cc:2387]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.33.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
| | or_data parameters statistics trace logging |
| model_repository_path[0] | ./vicuna-13b-ft |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0506 03:39:13.621653 14496 server.cc:284] Waiting for in-flight requests to complete.
I0506 03:39:13.621662 14496 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences
I0506 03:39:13.621669 14496 server.cc:315] All models are stopped, unloading models
I0506 03:39:13.621674 14496 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
Is my process correct? could anyone help me. Thank you!
@SupreethRao99
E0506 03:39:13.568928 14496 model_repository_manager.cc:1245] Poll failed for model directory '4-gpu': Invalid model name: Could not determine backend for model '4-gpu' with no backend in model configuration. Expected model name of the form 'model.
as the log show, it may be your directory name is wrong, can you show the tree of ./vicuna-13b-ft
sure, this is the content of ./vicuna-13b-ft
4-gpu
config.pbtxt
inside 4-gpu I have these files
config.ini model.layers.27.attention.dense.weight.0.bin
model.final_layernorm.weight.bin model.layers.27.attention.dense.weight.1.bin
model.layers.0.attention.dense.weight.0.bin model.layers.27.attention.dense.weight.2.bin
model.layers.0.attention.dense.weight.1.bin model.layers.27.attention.dense.weight.3.bin
model.layers.0.attention.dense.weight.2.bin model.layers.27.attention.query_key_value.weight.0.bin
model.layers.0.attention.dense.weight.3.bin model.layers.27.attention.query_key_value.weight.1.bin
model.layers.0.attention.query_key_value.weight.0.bin model.layers.27.attention.query_key_value.weight.2.bin
model.layers.0.attention.query_key_value.weight.1.bin model.layers.27.attention.query_key_value.weight.3.bin
model.layers.0.attention.query_key_value.weight.2.bin model.layers.27.input_layernorm.weight.bin
model.layers.0.attention.query_key_value.weight.3.bin model.layers.27.mlp.down_proj.weight.0.bin
model.layers.0.input_layernorm.weight.bin model.layers.27.mlp.down_proj.weight.1.bin
model.layers.0.mlp.down_proj.weight.0.bin model.layers.27.mlp.down_proj.weight.2.bin
model.layers.0.mlp.down_proj.weight.1.bin model.layers.27.mlp.down_proj.weight.3.bin
model.layers.0.mlp.down_proj.weight.2.bin model.layers.27.mlp.gate_proj.weight.0.bin
model.layers.0.mlp.down_proj.weight.3.bin model.layers.27.mlp.gate_proj.weight.1.bin
model.layers.0.mlp.gate_proj.weight.0.bin model.layers.27.mlp.gate_proj.weight.2.bin
model.layers.0.mlp.gate_proj.weight.1.bin model.layers.27.mlp.gate_proj.weight.3.bin
model.layers.0.mlp.gate_proj.weight.2.bin model.layers.27.mlp.up_proj.weight.0.bin
model.layers.0.mlp.gate_proj.weight.3.bin model.layers.27.mlp.up_proj.weight.1.bin
model.layers.0.mlp.up_proj.weight.0.bin model.layers.27.mlp.up_proj.weight.2.bin
model.layers.0.mlp.up_proj.weight.1.bin model.layers.27.mlp.up_proj.weight.3.bin
model.layers.0.mlp.up_proj.weight.2.bin model.layers.27.post_attention_layernorm.weight.bin
model.layers.0.mlp.up_proj.weight.3.bin model.layers.28.attention.dense.weight.0.bin
model.layers.0.post_attention_layernorm.weight.bin model.layers.28.attention.dense.weight.1.bin
model.layers.1.attention.dense.weight.0.bin model.layers.28.attention.dense.weight.2.bin
model.layers.1.attention.dense.weight.1.bin model.layers.28.attention.dense.weight.3.bin
model.layers.1.attention.dense.weight.2.bin model.layers.28.attention.query_key_value.weight.0.bin
model.layers.1.attention.dense.weight.3.bin model.layers.28.attention.query_key_value.weight.1.bin
model.layers.1.attention.query_key_value.weight.0.bin model.layers.28.attention.query_key_value.weight.2.bin
model.layers.1.attention.query_key_value.weight.1.bin model.layers.28.attention.query_key_value.weight.3.bin
model.layers.1.attention.query_key_value.weight.2.bin model.layers.28.input_layernorm.weight.bin
model.layers.1.attention.query_key_value.weight.3.bin model.layers.28.mlp.down_proj.weight.0.bin
model.layers.1.input_layernorm.weight.bin model.layers.28.mlp.down_proj.weight.1.bin
model.layers.1.mlp.down_proj.weight.0.bin model.layers.28.mlp.down_proj.weight.2.bin
model.layers.1.mlp.down_proj.weight.1.bin model.layers.28.mlp.down_proj.weight.3.bin
model.layers.1.mlp.down_proj.weight.2.bin model.layers.28.mlp.gate_proj.weight.0.bin
model.layers.1.mlp.down_proj.weight.3.bin model.layers.28.mlp.gate_proj.weight.1.bin
model.layers.1.mlp.gate_proj.weight.0.bin model.layers.28.mlp.gate_proj.weight.2.bin
model.layers.1.mlp.gate_proj.weight.1.bin model.layers.28.mlp.gate_proj.weight.3.bin
model.layers.1.mlp.gate_proj.weight.2.bin model.layers.28.mlp.up_proj.weight.0.bin
model.layers.1.mlp.gate_proj.weight.3.bin model.layers.28.mlp.up_proj.weight.1.bin
model.layers.1.mlp.up_proj.weight.0.bin model.layers.28.mlp.up_proj.weight.2.bin
model.layers.1.mlp.up_proj.weight.1.bin model.layers.28.mlp.up_proj.weight.3.bin
model.layers.1.mlp.up_proj.weight.2.bin model.layers.28.post_attention_layernorm.weight.bin
model.layers.1.mlp.up_proj.weight.3.bin model.layers.29.attention.dense.weight.0.bin
model.layers.1.post_attention_layernorm.weight.bin model.layers.29.attention.dense.weight.1.bin
model.layers.10.attention.dense.weight.0.bin model.layers.29.attention.dense.weight.2.bin
model.layers.10.attention.dense.weight.1.bin model.layers.29.attention.dense.weight.3.bin
model.layers.10.attention.dense.weight.2.bin model.layers.29.attention.query_key_value.weight.0.bin
model.layers.10.attention.dense.weight.3.bin model.layers.29.attention.query_key_value.weight.1.bin
model.layers.10.attention.query_key_value.weight.0.bin model.layers.29.attention.query_key_value.weight.2.bin
model.layers.10.attention.query_key_value.weight.1.bin model.layers.29.attention.query_key_value.weight.3.bin
model.layers.10.attention.query_key_value.weight.2.bin model.layers.29.input_layernorm.weight.bin
model.layers.10.attention.query_key_value.weight.3.bin model.layers.29.mlp.down_proj.weight.0.bin
model.layers.10.input_layernorm.weight.bin model.layers.29.mlp.down_proj.weight.1.bin
model.layers.10.mlp.down_proj.weight.0.bin model.layers.29.mlp.down_proj.weight.2.bin
model.layers.10.mlp.down_proj.weight.1.bin model.layers.29.mlp.down_proj.weight.3.bin
model.layers.10.mlp.down_proj.weight.2.bin model.layers.29.mlp.gate_proj.weight.0.bin
model.layers.10.mlp.down_proj.weight.3.bin model.layers.29.mlp.gate_proj.weight.1.bin
model.layers.10.mlp.gate_proj.weight.0.bin model.layers.29.mlp.gate_proj.weight.2.bin
model.layers.10.mlp.gate_proj.weight.1.bin model.layers.29.mlp.gate_proj.weight.3.bin
model.layers.10.mlp.gate_proj.weight.2.bin model.layers.29.mlp.up_proj.weight.0.bin
model.layers.10.mlp.gate_proj.weight.3.bin model.layers.29.mlp.up_proj.weight.1.bin
model.layers.10.mlp.up_proj.weight.0.bin model.layers.29.mlp.up_proj.weight.2.bin
model.layers.10.mlp.up_proj.weight.1.bin model.layers.29.mlp.up_proj.weight.3.bin
model.layers.10.mlp.up_proj.weight.2.bin model.layers.29.post_attention_layernorm.weight.bin
model.layers.10.mlp.up_proj.weight.3.bin model.layers.3.attention.dense.weight.0.bin
model.layers.10.post_attention_layernorm.weight.bin model.layers.3.attention.dense.weight.1.bin
model.layers.11.attention.dense.weight.0.bin model.layers.3.attention.dense.weight.2.bin
model.layers.11.attention.dense.weight.1.bin model.layers.3.attention.dense.weight.3.bin
model.layers.11.attention.dense.weight.2.bin model.layers.3.attention.query_key_value.weight.0.bin
model.layers.11.attention.dense.weight.3.bin model.layers.3.attention.query_key_value.weight.1.bin
model.layers.11.attention.query_key_value.weight.0.bin model.layers.3.attention.query_key_value.weight.2.bin
model.layers.11.attention.query_key_value.weight.1.bin model.layers.3.attention.query_key_value.weight.3.bin
model.layers.11.attention.query_key_value.weight.2.bin model.layers.3.input_layernorm.weight.bin
model.layers.11.attention.query_key_value.weight.3.bin model.layers.3.mlp.down_proj.weight.0.bin
model.layers.11.input_layernorm.weight.bin model.layers.3.mlp.down_proj.weight.1.bin
model.layers.11.mlp.down_proj.weight.0.bin model.layers.3.mlp.down_proj.weight.2.bin
model.layers.11.mlp.down_proj.weight.1.bin model.layers.3.mlp.down_proj.weight.3.bin
model.layers.11.mlp.down_proj.weight.2.bin model.layers.3.mlp.gate_proj.weight.0.bin
model.layers.11.mlp.down_proj.weight.3.bin model.layers.3.mlp.gate_proj.weight.1.bin
model.layers.11.mlp.gate_proj.weight.0.bin model.layers.3.mlp.gate_proj.weight.2.bin
model.layers.11.mlp.gate_proj.weight.1.bin model.layers.3.mlp.gate_proj.weight.3.bin
model.layers.11.mlp.gate_proj.weight.2.bin model.layers.3.mlp.up_proj.weight.0.bin
model.layers.11.mlp.gate_proj.weight.3.bin model.layers.3.mlp.up_proj.weight.1.bin
model.layers.11.mlp.up_proj.weight.0.bin model.layers.3.mlp.up_proj.weight.2.bin
model.layers.11.mlp.up_proj.weight.1.bin model.layers.3.mlp.up_proj.weight.3.bin
model.layers.11.mlp.up_proj.weight.2.bin model.layers.3.post_attention_layernorm.weight.bin
model.layers.11.mlp.up_proj.weight.3.bin model.layers.30.attention.dense.weight.0.bin
model.layers.11.post_attention_layernorm.weight.bin model.layers.30.attention.dense.weight.1.bin
model.layers.12.attention.dense.weight.0.bin model.layers.30.attention.dense.weight.2.bin
model.layers.12.attention.dense.weight.1.bin model.layers.30.attention.dense.weight.3.bin
model.layers.12.attention.dense.weight.2.bin model.layers.30.attention.query_key_value.weight.0.bin
model.layers.12.attention.dense.weight.3.bin model.layers.30.attention.query_key_value.weight.1.bin
model.layers.12.attention.query_key_value.weight.0.bin model.layers.30.attention.query_key_value.weight.2.bin
model.layers.12.attention.query_key_value.weight.1.bin model.layers.30.attention.query_key_value.weight.3.bin
model.layers.12.attention.query_key_value.weight.2.bin model.layers.30.input_layernorm.weight.bin
model.layers.12.attention.query_key_value.weight.3.bin model.layers.30.mlp.down_proj.weight.0.bin
model.layers.12.input_layernorm.weight.bin model.layers.30.mlp.down_proj.weight.1.bin
model.layers.12.mlp.down_proj.weight.0.bin model.layers.30.mlp.down_proj.weight.2.bin
model.layers.12.mlp.down_proj.weight.1.bin model.layers.30.mlp.down_proj.weight.3.bin
model.layers.12.mlp.down_proj.weight.2.bin model.layers.30.mlp.gate_proj.weight.0.bin
model.layers.12.mlp.down_proj.weight.3.bin model.layers.30.mlp.gate_proj.weight.1.bin
model.layers.12.mlp.gate_proj.weight.0.bin model.layers.30.mlp.gate_proj.weight.2.bin
model.layers.12.mlp.gate_proj.weight.1.bin model.layers.30.mlp.gate_proj.weight.3.bin
model.layers.12.mlp.gate_proj.weight.2.bin model.layers.30.mlp.up_proj.weight.0.bin
model.layers.12.mlp.gate_proj.weight.3.bin model.layers.30.mlp.up_proj.weight.1.bin
model.layers.12.mlp.up_proj.weight.0.bin model.layers.30.mlp.up_proj.weight.2.bin
model.layers.12.mlp.up_proj.weight.1.bin model.layers.30.mlp.up_proj.weight.3.bin
model.layers.12.mlp.up_proj.weight.2.bin model.layers.30.post_attention_layernorm.weight.bin
model.layers.12.mlp.up_proj.weight.3.bin model.layers.31.attention.dense.weight.0.bin
model.layers.12.post_attention_layernorm.weight.bin model.layers.31.attention.dense.weight.1.bin
model.layers.13.attention.dense.weight.0.bin model.layers.31.attention.dense.weight.2.bin
model.layers.13.attention.dense.weight.1.bin model.layers.31.attention.dense.weight.3.bin
model.layers.13.attention.dense.weight.2.bin model.layers.31.attention.query_key_value.weight.0.bin
model.layers.13.attention.dense.weight.3.bin model.layers.31.attention.query_key_value.weight.1.bin
model.layers.13.attention.query_key_value.weight.0.bin model.layers.31.attention.query_key_value.weight.2.bin
model.layers.13.attention.query_key_value.weight.1.bin model.layers.31.attention.query_key_value.weight.3.bin
model.layers.13.attention.query_key_value.weight.2.bin model.layers.31.input_layernorm.weight.bin
model.layers.13.attention.query_key_value.weight.3.bin model.layers.31.mlp.down_proj.weight.0.bin
model.layers.13.input_layernorm.weight.bin model.layers.31.mlp.down_proj.weight.1.bin
model.layers.13.mlp.down_proj.weight.0.bin model.layers.31.mlp.down_proj.weight.2.bin
model.layers.13.mlp.down_proj.weight.1.bin model.layers.31.mlp.down_proj.weight.3.bin
model.layers.13.mlp.down_proj.weight.2.bin model.layers.31.mlp.gate_proj.weight.0.bin
model.layers.13.mlp.down_proj.weight.3.bin model.layers.31.mlp.gate_proj.weight.1.bin
model.layers.13.mlp.gate_proj.weight.0.bin model.layers.31.mlp.gate_proj.weight.2.bin
model.layers.13.mlp.gate_proj.weight.1.bin model.layers.31.mlp.gate_proj.weight.3.bin
model.layers.13.mlp.gate_proj.weight.2.bin model.layers.31.mlp.up_proj.weight.0.bin
model.layers.13.mlp.gate_proj.weight.3.bin model.layers.31.mlp.up_proj.weight.1.bin
model.layers.13.mlp.up_proj.weight.0.bin model.layers.31.mlp.up_proj.weight.2.bin
model.layers.13.mlp.up_proj.weight.1.bin model.layers.31.mlp.up_proj.weight.3.bin
model.layers.13.mlp.up_proj.weight.2.bin model.layers.31.post_attention_layernorm.weight.bin
model.layers.13.mlp.up_proj.weight.3.bin model.layers.32.attention.dense.weight.0.bin
model.layers.13.post_attention_layernorm.weight.bin model.layers.32.attention.dense.weight.1.bin
model.layers.14.attention.dense.weight.0.bin model.layers.32.attention.dense.weight.2.bin
model.layers.14.attention.dense.weight.1.bin model.layers.32.attention.dense.weight.3.bin
model.layers.14.attention.dense.weight.2.bin model.layers.32.attention.query_key_value.weight.0.bin
model.layers.14.attention.dense.weight.3.bin model.layers.32.attention.query_key_value.weight.1.bin
model.layers.14.attention.query_key_value.weight.0.bin model.layers.32.attention.query_key_value.weight.2.bin
model.layers.14.attention.query_key_value.weight.1.bin model.layers.32.attention.query_key_value.weight.3.bin
model.layers.14.attention.query_key_value.weight.2.bin model.layers.32.input_layernorm.weight.bin
model.layers.14.attention.query_key_value.weight.3.bin model.layers.32.mlp.down_proj.weight.0.bin
model.layers.14.input_layernorm.weight.bin model.layers.32.mlp.down_proj.weight.1.bin
model.layers.14.mlp.down_proj.weight.0.bin model.layers.32.mlp.down_proj.weight.2.bin
model.layers.14.mlp.down_proj.weight.1.bin model.layers.32.mlp.down_proj.weight.3.bin
model.layers.14.mlp.down_proj.weight.2.bin model.layers.32.mlp.gate_proj.weight.0.bin
model.layers.14.mlp.down_proj.weight.3.bin model.layers.32.mlp.gate_proj.weight.1.bin
model.layers.14.mlp.gate_proj.weight.0.bin model.layers.32.mlp.gate_proj.weight.2.bin
model.layers.14.mlp.gate_proj.weight.1.bin model.layers.32.mlp.gate_proj.weight.3.bin
model.layers.14.mlp.gate_proj.weight.2.bin model.layers.32.mlp.up_proj.weight.0.bin
model.layers.14.mlp.gate_proj.weight.3.bin model.layers.32.mlp.up_proj.weight.1.bin
model.layers.14.mlp.up_proj.weight.0.bin model.layers.32.mlp.up_proj.weight.2.bin
model.layers.14.mlp.up_proj.weight.1.bin model.layers.32.mlp.up_proj.weight.3.bin
model.layers.14.mlp.up_proj.weight.2.bin model.layers.32.post_attention_layernorm.weight.bin
model.layers.14.mlp.up_proj.weight.3.bin model.layers.33.attention.dense.weight.0.bin
model.layers.14.post_attention_layernorm.weight.bin model.layers.33.attention.dense.weight.1.bin
model.layers.15.attention.dense.weight.0.bin model.layers.33.attention.dense.weight.2.bin
model.layers.15.attention.dense.weight.1.bin model.layers.33.attention.dense.weight.3.bin
model.layers.15.attention.dense.weight.2.bin model.layers.33.attention.query_key_value.weight.0.bin
model.layers.15.attention.dense.weight.3.bin model.layers.33.attention.query_key_value.weight.1.bin
model.layers.15.attention.query_key_value.weight.0.bin model.layers.33.attention.query_key_value.weight.2.bin
model.layers.15.attention.query_key_value.weight.1.bin model.layers.33.attention.query_key_value.weight.3.bin
model.layers.15.attention.query_key_value.weight.2.bin model.layers.33.input_layernorm.weight.bin
model.layers.15.attention.query_key_value.weight.3.bin model.layers.33.mlp.down_proj.weight.0.bin
model.layers.15.input_layernorm.weight.bin model.layers.33.mlp.down_proj.weight.1.bin
model.layers.15.mlp.down_proj.weight.0.bin model.layers.33.mlp.down_proj.weight.2.bin
model.layers.15.mlp.down_proj.weight.1.bin model.layers.33.mlp.down_proj.weight.3.bin
model.layers.15.mlp.down_proj.weight.2.bin model.layers.33.mlp.gate_proj.weight.0.bin
model.layers.15.mlp.down_proj.weight.3.bin model.layers.33.mlp.gate_proj.weight.1.bin
model.layers.15.mlp.gate_proj.weight.0.bin model.layers.33.mlp.gate_proj.weight.2.bin
model.layers.15.mlp.gate_proj.weight.1.bin model.layers.33.mlp.gate_proj.weight.3.bin
model.layers.15.mlp.gate_proj.weight.2.bin model.layers.33.mlp.up_proj.weight.0.bin
model.layers.15.mlp.gate_proj.weight.3.bin model.layers.33.mlp.up_proj.weight.1.bin
model.layers.15.mlp.up_proj.weight.0.bin model.layers.33.mlp.up_proj.weight.2.bin
model.layers.15.mlp.up_proj.weight.1.bin model.layers.33.mlp.up_proj.weight.3.bin
model.layers.15.mlp.up_proj.weight.2.bin model.layers.33.post_attention_layernorm.weight.bin
model.layers.15.mlp.up_proj.weight.3.bin model.layers.34.attention.dense.weight.0.bin
model.layers.15.post_attention_layernorm.weight.bin model.layers.34.attention.dense.weight.1.bin
model.layers.16.attention.dense.weight.0.bin model.layers.34.attention.dense.weight.2.bin
model.layers.16.attention.dense.weight.1.bin model.layers.34.attention.dense.weight.3.bin
model.layers.16.attention.dense.weight.2.bin model.layers.34.attention.query_key_value.weight.0.bin
model.layers.16.attention.dense.weight.3.bin model.layers.34.attention.query_key_value.weight.1.bin
model.layers.16.attention.query_key_value.weight.0.bin model.layers.34.attention.query_key_value.weight.2.bin
model.layers.16.attention.query_key_value.weight.1.bin model.layers.34.attention.query_key_value.weight.3.bin
model.layers.16.attention.query_key_value.weight.2.bin model.layers.34.input_layernorm.weight.bin
model.layers.16.attention.query_key_value.weight.3.bin model.layers.34.mlp.down_proj.weight.0.bin
model.layers.16.input_layernorm.weight.bin model.layers.34.mlp.down_proj.weight.1.bin
model.layers.16.mlp.down_proj.weight.0.bin model.layers.34.mlp.down_proj.weight.2.bin
model.layers.16.mlp.down_proj.weight.1.bin model.layers.34.mlp.down_proj.weight.3.bin
model.layers.16.mlp.down_proj.weight.2.bin model.layers.34.mlp.gate_proj.weight.0.bin
model.layers.16.mlp.down_proj.weight.3.bin model.layers.34.mlp.gate_proj.weight.1.bin
model.layers.16.mlp.gate_proj.weight.0.bin model.layers.34.mlp.gate_proj.weight.2.bin
model.layers.16.mlp.gate_proj.weight.1.bin model.layers.34.mlp.gate_proj.weight.3.bin
model.layers.16.mlp.gate_proj.weight.2.bin model.layers.34.mlp.up_proj.weight.0.bin
model.layers.16.mlp.gate_proj.weight.3.bin model.layers.34.mlp.up_proj.weight.1.bin
model.layers.16.mlp.up_proj.weight.0.bin model.layers.34.mlp.up_proj.weight.2.bin
model.layers.16.mlp.up_proj.weight.1.bin model.layers.34.mlp.up_proj.weight.3.bin
model.layers.16.mlp.up_proj.weight.2.bin model.layers.34.post_attention_layernorm.weight.bin
model.layers.16.mlp.up_proj.weight.3.bin model.layers.35.attention.dense.weight.0.bin
model.layers.16.post_attention_layernorm.weight.bin model.layers.35.attention.dense.weight.1.bin
model.layers.17.attention.dense.weight.0.bin model.layers.35.attention.dense.weight.2.bin
model.layers.17.attention.dense.weight.1.bin model.layers.35.attention.dense.weight.3.bin
model.layers.17.attention.dense.weight.2.bin model.layers.35.attention.query_key_value.weight.0.bin
model.layers.17.attention.dense.weight.3.bin model.layers.35.attention.query_key_value.weight.1.bin
model.layers.17.attention.query_key_value.weight.0.bin model.layers.35.attention.query_key_value.weight.2.bin
model.layers.17.attention.query_key_value.weight.1.bin model.layers.35.attention.query_key_value.weight.3.bin
model.layers.17.attention.query_key_value.weight.2.bin model.layers.35.input_layernorm.weight.bin
model.layers.17.attention.query_key_value.weight.3.bin model.layers.35.mlp.down_proj.weight.0.bin
model.layers.17.input_layernorm.weight.bin model.layers.35.mlp.down_proj.weight.1.bin
model.layers.17.mlp.down_proj.weight.0.bin model.layers.35.mlp.down_proj.weight.2.bin
model.layers.17.mlp.down_proj.weight.1.bin model.layers.35.mlp.down_proj.weight.3.bin
model.layers.17.mlp.down_proj.weight.2.bin model.layers.35.mlp.gate_proj.weight.0.bin
model.layers.17.mlp.down_proj.weight.3.bin model.layers.35.mlp.gate_proj.weight.1.bin
model.layers.17.mlp.gate_proj.weight.0.bin model.layers.35.mlp.gate_proj.weight.2.bin
model.layers.17.mlp.gate_proj.weight.1.bin model.layers.35.mlp.gate_proj.weight.3.bin
model.layers.17.mlp.gate_proj.weight.2.bin model.layers.35.mlp.up_proj.weight.0.bin
model.layers.17.mlp.gate_proj.weight.3.bin model.layers.35.mlp.up_proj.weight.1.bin
model.layers.17.mlp.up_proj.weight.0.bin model.layers.35.mlp.up_proj.weight.2.bin
model.layers.17.mlp.up_proj.weight.1.bin model.layers.35.mlp.up_proj.weight.3.bin
model.layers.17.mlp.up_proj.weight.2.bin model.layers.35.post_attention_layernorm.weight.bin
model.layers.17.mlp.up_proj.weight.3.bin model.layers.36.attention.dense.weight.0.bin
model.layers.17.post_attention_layernorm.weight.bin model.layers.36.attention.dense.weight.1.bin
model.layers.18.attention.dense.weight.0.bin model.layers.36.attention.dense.weight.2.bin
model.layers.18.attention.dense.weight.1.bin model.layers.36.attention.dense.weight.3.bin
model.layers.18.attention.dense.weight.2.bin model.layers.36.attention.query_key_value.weight.0.bin
model.layers.18.attention.dense.weight.3.bin model.layers.36.attention.query_key_value.weight.1.bin
model.layers.18.attention.query_key_value.weight.0.bin model.layers.36.attention.query_key_value.weight.2.bin
model.layers.18.attention.query_key_value.weight.1.bin model.layers.36.attention.query_key_value.weight.3.bin
model.layers.18.attention.query_key_value.weight.2.bin model.layers.36.input_layernorm.weight.bin
model.layers.18.attention.query_key_value.weight.3.bin model.layers.36.mlp.down_proj.weight.0.bin
model.layers.18.input_layernorm.weight.bin model.layers.36.mlp.down_proj.weight.1.bin
model.layers.18.mlp.down_proj.weight.0.bin model.layers.36.mlp.down_proj.weight.2.bin
model.layers.18.mlp.down_proj.weight.1.bin model.layers.36.mlp.down_proj.weight.3.bin
model.layers.18.mlp.down_proj.weight.2.bin model.layers.36.mlp.gate_proj.weight.0.bin
model.layers.18.mlp.down_proj.weight.3.bin model.layers.36.mlp.gate_proj.weight.1.bin
model.layers.18.mlp.gate_proj.weight.0.bin model.layers.36.mlp.gate_proj.weight.2.bin
model.layers.18.mlp.gate_proj.weight.1.bin model.layers.36.mlp.gate_proj.weight.3.bin
model.layers.18.mlp.gate_proj.weight.2.bin model.layers.36.mlp.up_proj.weight.0.bin
model.layers.18.mlp.gate_proj.weight.3.bin model.layers.36.mlp.up_proj.weight.1.bin
model.layers.18.mlp.up_proj.weight.0.bin model.layers.36.mlp.up_proj.weight.2.bin
model.layers.18.mlp.up_proj.weight.1.bin model.layers.36.mlp.up_proj.weight.3.bin
model.layers.18.mlp.up_proj.weight.2.bin model.layers.36.post_attention_layernorm.weight.bin
model.layers.18.mlp.up_proj.weight.3.bin model.layers.37.attention.dense.weight.0.bin
model.layers.18.post_attention_layernorm.weight.bin model.layers.37.attention.dense.weight.1.bin
model.layers.19.attention.dense.weight.0.bin model.layers.37.attention.dense.weight.2.bin
model.layers.19.attention.dense.weight.1.bin model.layers.37.attention.dense.weight.3.bin
model.layers.19.attention.dense.weight.2.bin model.layers.37.attention.query_key_value.weight.0.bin
model.layers.19.attention.dense.weight.3.bin model.layers.37.attention.query_key_value.weight.1.bin
model.layers.19.attention.query_key_value.weight.0.bin model.layers.37.attention.query_key_value.weight.2.bin
model.layers.19.attention.query_key_value.weight.1.bin model.layers.37.attention.query_key_value.weight.3.bin
model.layers.19.attention.query_key_value.weight.2.bin model.layers.37.input_layernorm.weight.bin
model.layers.19.attention.query_key_value.weight.3.bin model.layers.37.mlp.down_proj.weight.0.bin
model.layers.19.input_layernorm.weight.bin model.layers.37.mlp.down_proj.weight.1.bin
model.layers.19.mlp.down_proj.weight.0.bin model.layers.37.mlp.down_proj.weight.2.bin
model.layers.19.mlp.down_proj.weight.1.bin model.layers.37.mlp.down_proj.weight.3.bin
model.layers.19.mlp.down_proj.weight.2.bin model.layers.37.mlp.gate_proj.weight.0.bin
model.layers.19.mlp.down_proj.weight.3.bin model.layers.37.mlp.gate_proj.weight.1.bin
model.layers.19.mlp.gate_proj.weight.0.bin model.layers.37.mlp.gate_proj.weight.2.bin
model.layers.19.mlp.gate_proj.weight.1.bin model.layers.37.mlp.gate_proj.weight.3.bin
model.layers.19.mlp.gate_proj.weight.2.bin model.layers.37.mlp.up_proj.weight.0.bin
model.layers.19.mlp.gate_proj.weight.3.bin model.layers.37.mlp.up_proj.weight.1.bin
model.layers.19.mlp.up_proj.weight.0.bin model.layers.37.mlp.up_proj.weight.2.bin
model.layers.19.mlp.up_proj.weight.1.bin model.layers.37.mlp.up_proj.weight.3.bin
model.layers.19.mlp.up_proj.weight.2.bin model.layers.37.post_attention_layernorm.weight.bin
model.layers.19.mlp.up_proj.weight.3.bin model.layers.38.attention.dense.weight.0.bin
model.layers.19.post_attention_layernorm.weight.bin model.layers.38.attention.dense.weight.1.bin
model.layers.2.attention.dense.weight.0.bin model.layers.38.attention.dense.weight.2.bin
model.layers.2.attention.dense.weight.1.bin model.layers.38.attention.dense.weight.3.bin
model.layers.2.attention.dense.weight.2.bin model.layers.38.attention.query_key_value.weight.0.bin
model.layers.2.attention.dense.weight.3.bin model.layers.38.attention.query_key_value.weight.1.bin
model.layers.2.attention.query_key_value.weight.0.bin model.layers.38.attention.query_key_value.weight.2.bin
model.layers.2.attention.query_key_value.weight.1.bin model.layers.38.attention.query_key_value.weight.3.bin
model.layers.2.attention.query_key_value.weight.2.bin model.layers.38.input_layernorm.weight.bin
model.layers.2.attention.query_key_value.weight.3.bin model.layers.38.mlp.down_proj.weight.0.bin
model.layers.2.input_layernorm.weight.bin model.layers.38.mlp.down_proj.weight.1.bin
model.layers.2.mlp.down_proj.weight.0.bin model.layers.38.mlp.down_proj.weight.2.bin
model.layers.2.mlp.down_proj.weight.1.bin model.layers.38.mlp.down_proj.weight.3.bin
model.layers.2.mlp.down_proj.weight.2.bin model.layers.38.mlp.gate_proj.weight.0.bin
model.layers.2.mlp.down_proj.weight.3.bin model.layers.38.mlp.gate_proj.weight.1.bin
model.layers.2.mlp.gate_proj.weight.0.bin model.layers.38.mlp.gate_proj.weight.2.bin
model.layers.2.mlp.gate_proj.weight.1.bin model.layers.38.mlp.gate_proj.weight.3.bin
model.layers.2.mlp.gate_proj.weight.2.bin model.layers.38.mlp.up_proj.weight.0.bin
model.layers.2.mlp.gate_proj.weight.3.bin model.layers.38.mlp.up_proj.weight.1.bin
model.layers.2.mlp.up_proj.weight.0.bin model.layers.38.mlp.up_proj.weight.2.bin
model.layers.2.mlp.up_proj.weight.1.bin model.layers.38.mlp.up_proj.weight.3.bin
model.layers.2.mlp.up_proj.weight.2.bin model.layers.38.post_attention_layernorm.weight.bin
model.layers.2.mlp.up_proj.weight.3.bin model.layers.39.attention.dense.weight.0.bin
model.layers.2.post_attention_layernorm.weight.bin model.layers.39.attention.dense.weight.1.bin
model.layers.20.attention.dense.weight.0.bin model.layers.39.attention.dense.weight.2.bin
model.layers.20.attention.dense.weight.1.bin model.layers.39.attention.dense.weight.3.bin
model.layers.20.attention.dense.weight.2.bin model.layers.39.attention.query_key_value.weight.0.bin
model.layers.20.attention.dense.weight.3.bin model.layers.39.attention.query_key_value.weight.1.bin
model.layers.20.attention.query_key_value.weight.0.bin model.layers.39.attention.query_key_value.weight.2.bin
model.layers.20.attention.query_key_value.weight.1.bin model.layers.39.attention.query_key_value.weight.3.bin
model.layers.20.attention.query_key_value.weight.2.bin model.layers.39.input_layernorm.weight.bin
model.layers.20.attention.query_key_value.weight.3.bin model.layers.39.mlp.down_proj.weight.0.bin
model.layers.20.input_layernorm.weight.bin model.layers.39.mlp.down_proj.weight.1.bin
model.layers.20.mlp.down_proj.weight.0.bin model.layers.39.mlp.down_proj.weight.2.bin
model.layers.20.mlp.down_proj.weight.1.bin model.layers.39.mlp.down_proj.weight.3.bin
model.layers.20.mlp.down_proj.weight.2.bin model.layers.39.mlp.gate_proj.weight.0.bin
model.layers.20.mlp.down_proj.weight.3.bin model.layers.39.mlp.gate_proj.weight.1.bin
model.layers.20.mlp.gate_proj.weight.0.bin model.layers.39.mlp.gate_proj.weight.2.bin
model.layers.20.mlp.gate_proj.weight.1.bin model.layers.39.mlp.gate_proj.weight.3.bin
model.layers.20.mlp.gate_proj.weight.2.bin model.layers.39.mlp.up_proj.weight.0.bin
model.layers.20.mlp.gate_proj.weight.3.bin model.layers.39.mlp.up_proj.weight.1.bin
model.layers.20.mlp.up_proj.weight.0.bin model.layers.39.mlp.up_proj.weight.2.bin
model.layers.20.mlp.up_proj.weight.1.bin model.layers.39.mlp.up_proj.weight.3.bin
model.layers.20.mlp.up_proj.weight.2.bin model.layers.39.post_attention_layernorm.weight.bin
model.layers.20.mlp.up_proj.weight.3.bin model.layers.4.attention.dense.weight.0.bin
model.layers.20.post_attention_layernorm.weight.bin model.layers.4.attention.dense.weight.1.bin
model.layers.21.attention.dense.weight.0.bin model.layers.4.attention.dense.weight.2.bin
model.layers.21.attention.dense.weight.1.bin model.layers.4.attention.dense.weight.3.bin
model.layers.21.attention.dense.weight.2.bin model.layers.4.attention.query_key_value.weight.0.bin
model.layers.21.attention.dense.weight.3.bin model.layers.4.attention.query_key_value.weight.1.bin
model.layers.21.attention.query_key_value.weight.0.bin model.layers.4.attention.query_key_value.weight.2.bin
model.layers.21.attention.query_key_value.weight.1.bin model.layers.4.attention.query_key_value.weight.3.bin
model.layers.21.attention.query_key_value.weight.2.bin model.layers.4.input_layernorm.weight.bin
model.layers.21.attention.query_key_value.weight.3.bin model.layers.4.mlp.down_proj.weight.0.bin
model.layers.21.input_layernorm.weight.bin model.layers.4.mlp.down_proj.weight.1.bin
model.layers.21.mlp.down_proj.weight.0.bin model.layers.4.mlp.down_proj.weight.2.bin
model.layers.21.mlp.down_proj.weight.1.bin model.layers.4.mlp.down_proj.weight.3.bin
model.layers.21.mlp.down_proj.weight.2.bin model.layers.4.mlp.gate_proj.weight.0.bin
model.layers.21.mlp.down_proj.weight.3.bin model.layers.4.mlp.gate_proj.weight.1.bin
model.layers.21.mlp.gate_proj.weight.0.bin model.layers.4.mlp.gate_proj.weight.2.bin
model.layers.21.mlp.gate_proj.weight.1.bin model.layers.4.mlp.gate_proj.weight.3.bin
model.layers.21.mlp.gate_proj.weight.2.bin model.layers.4.mlp.up_proj.weight.0.bin
model.layers.21.mlp.gate_proj.weight.3.bin model.layers.4.mlp.up_proj.weight.1.bin
model.layers.21.mlp.up_proj.weight.0.bin model.layers.4.mlp.up_proj.weight.2.bin
model.layers.21.mlp.up_proj.weight.1.bin model.layers.4.mlp.up_proj.weight.3.bin
model.layers.21.mlp.up_proj.weight.2.bin model.layers.4.post_attention_layernorm.weight.bin
model.layers.21.mlp.up_proj.weight.3.bin model.layers.5.attention.dense.weight.0.bin
model.layers.21.post_attention_layernorm.weight.bin model.layers.5.attention.dense.weight.1.bin
model.layers.22.attention.dense.weight.0.bin model.layers.5.attention.dense.weight.2.bin
model.layers.22.attention.dense.weight.1.bin model.layers.5.attention.dense.weight.3.bin
model.layers.22.attention.dense.weight.2.bin model.layers.5.attention.query_key_value.weight.0.bin
model.layers.22.attention.dense.weight.3.bin model.layers.5.attention.query_key_value.weight.1.bin
model.layers.22.attention.query_key_value.weight.0.bin model.layers.5.attention.query_key_value.weight.2.bin
model.layers.22.attention.query_key_value.weight.1.bin model.layers.5.attention.query_key_value.weight.3.bin
model.layers.22.attention.query_key_value.weight.2.bin model.layers.5.input_layernorm.weight.bin
model.layers.22.attention.query_key_value.weight.3.bin model.layers.5.mlp.down_proj.weight.0.bin
model.layers.22.input_layernorm.weight.bin model.layers.5.mlp.down_proj.weight.1.bin
model.layers.22.mlp.down_proj.weight.0.bin model.layers.5.mlp.down_proj.weight.2.bin
model.layers.22.mlp.down_proj.weight.1.bin model.layers.5.mlp.down_proj.weight.3.bin
model.layers.22.mlp.down_proj.weight.2.bin model.layers.5.mlp.gate_proj.weight.0.bin
model.layers.22.mlp.down_proj.weight.3.bin model.layers.5.mlp.gate_proj.weight.1.bin
model.layers.22.mlp.gate_proj.weight.0.bin model.layers.5.mlp.gate_proj.weight.2.bin
model.layers.22.mlp.gate_proj.weight.1.bin model.layers.5.mlp.gate_proj.weight.3.bin
model.layers.22.mlp.gate_proj.weight.2.bin model.layers.5.mlp.up_proj.weight.0.bin
model.layers.22.mlp.gate_proj.weight.3.bin model.layers.5.mlp.up_proj.weight.1.bin
model.layers.22.mlp.up_proj.weight.0.bin model.layers.5.mlp.up_proj.weight.2.bin
model.layers.22.mlp.up_proj.weight.1.bin model.layers.5.mlp.up_proj.weight.3.bin
model.layers.22.mlp.up_proj.weight.2.bin model.layers.5.post_attention_layernorm.weight.bin
model.layers.22.mlp.up_proj.weight.3.bin model.layers.6.attention.dense.weight.0.bin
model.layers.22.post_attention_layernorm.weight.bin model.layers.6.attention.dense.weight.1.bin
model.layers.23.attention.dense.weight.0.bin model.layers.6.attention.dense.weight.2.bin
model.layers.23.attention.dense.weight.1.bin model.layers.6.attention.dense.weight.3.bin
model.layers.23.attention.dense.weight.2.bin model.layers.6.attention.query_key_value.weight.0.bin
model.layers.23.attention.dense.weight.3.bin model.layers.6.attention.query_key_value.weight.1.bin
model.layers.23.attention.query_key_value.weight.0.bin model.layers.6.attention.query_key_value.weight.2.bin
model.layers.23.attention.query_key_value.weight.1.bin model.layers.6.attention.query_key_value.weight.3.bin
model.layers.23.attention.query_key_value.weight.2.bin model.layers.6.input_layernorm.weight.bin
model.layers.23.attention.query_key_value.weight.3.bin model.layers.6.mlp.down_proj.weight.0.bin
model.layers.23.input_layernorm.weight.bin model.layers.6.mlp.down_proj.weight.1.bin
model.layers.23.mlp.down_proj.weight.0.bin model.layers.6.mlp.down_proj.weight.2.bin
model.layers.23.mlp.down_proj.weight.1.bin model.layers.6.mlp.down_proj.weight.3.bin
model.layers.23.mlp.down_proj.weight.2.bin model.layers.6.mlp.gate_proj.weight.0.bin
model.layers.23.mlp.down_proj.weight.3.bin model.layers.6.mlp.gate_proj.weight.1.bin
model.layers.23.mlp.gate_proj.weight.0.bin model.layers.6.mlp.gate_proj.weight.2.bin
model.layers.23.mlp.gate_proj.weight.1.bin model.layers.6.mlp.gate_proj.weight.3.bin
model.layers.23.mlp.gate_proj.weight.2.bin model.layers.6.mlp.up_proj.weight.0.bin
model.layers.23.mlp.gate_proj.weight.3.bin model.layers.6.mlp.up_proj.weight.1.bin
model.layers.23.mlp.up_proj.weight.0.bin model.layers.6.mlp.up_proj.weight.2.bin
model.layers.23.mlp.up_proj.weight.1.bin model.layers.6.mlp.up_proj.weight.3.bin
model.layers.23.mlp.up_proj.weight.2.bin model.layers.6.post_attention_layernorm.weight.bin
model.layers.23.mlp.up_proj.weight.3.bin model.layers.7.attention.dense.weight.0.bin
model.layers.23.post_attention_layernorm.weight.bin model.layers.7.attention.dense.weight.1.bin
model.layers.24.attention.dense.weight.0.bin model.layers.7.attention.dense.weight.2.bin
model.layers.24.attention.dense.weight.1.bin model.layers.7.attention.dense.weight.3.bin
model.layers.24.attention.dense.weight.2.bin model.layers.7.attention.query_key_value.weight.0.bin
model.layers.24.attention.dense.weight.3.bin model.layers.7.attention.query_key_value.weight.1.bin
model.layers.24.attention.query_key_value.weight.0.bin model.layers.7.attention.query_key_value.weight.2.bin
model.layers.24.attention.query_key_value.weight.1.bin model.layers.7.attention.query_key_value.weight.3.bin
model.layers.24.attention.query_key_value.weight.2.bin model.layers.7.input_layernorm.weight.bin
model.layers.24.attention.query_key_value.weight.3.bin model.layers.7.mlp.down_proj.weight.0.bin
model.layers.24.input_layernorm.weight.bin model.layers.7.mlp.down_proj.weight.1.bin
model.layers.24.mlp.down_proj.weight.0.bin model.layers.7.mlp.down_proj.weight.2.bin
model.layers.24.mlp.down_proj.weight.1.bin model.layers.7.mlp.down_proj.weight.3.bin
model.layers.24.mlp.down_proj.weight.2.bin model.layers.7.mlp.gate_proj.weight.0.bin
model.layers.24.mlp.down_proj.weight.3.bin model.layers.7.mlp.gate_proj.weight.1.bin
model.layers.24.mlp.gate_proj.weight.0.bin model.layers.7.mlp.gate_proj.weight.2.bin
model.layers.24.mlp.gate_proj.weight.1.bin model.layers.7.mlp.gate_proj.weight.3.bin
model.layers.24.mlp.gate_proj.weight.2.bin model.layers.7.mlp.up_proj.weight.0.bin
model.layers.24.mlp.gate_proj.weight.3.bin model.layers.7.mlp.up_proj.weight.1.bin
model.layers.24.mlp.up_proj.weight.0.bin model.layers.7.mlp.up_proj.weight.2.bin
model.layers.24.mlp.up_proj.weight.1.bin model.layers.7.mlp.up_proj.weight.3.bin
model.layers.24.mlp.up_proj.weight.2.bin model.layers.7.post_attention_layernorm.weight.bin
model.layers.24.mlp.up_proj.weight.3.bin model.layers.8.attention.dense.weight.0.bin
model.layers.24.post_attention_layernorm.weight.bin model.layers.8.attention.dense.weight.1.bin
model.layers.25.attention.dense.weight.0.bin model.layers.8.attention.dense.weight.2.bin
model.layers.25.attention.dense.weight.1.bin model.layers.8.attention.dense.weight.3.bin
model.layers.25.attention.dense.weight.2.bin model.layers.8.attention.query_key_value.weight.0.bin
model.layers.25.attention.dense.weight.3.bin model.layers.8.attention.query_key_value.weight.1.bin
model.layers.25.attention.query_key_value.weight.0.bin model.layers.8.attention.query_key_value.weight.2.bin
model.layers.25.attention.query_key_value.weight.1.bin model.layers.8.attention.query_key_value.weight.3.bin
model.layers.25.attention.query_key_value.weight.2.bin model.layers.8.input_layernorm.weight.bin
model.layers.25.attention.query_key_value.weight.3.bin model.layers.8.mlp.down_proj.weight.0.bin
model.layers.25.input_layernorm.weight.bin model.layers.8.mlp.down_proj.weight.1.bin
model.layers.25.mlp.down_proj.weight.0.bin model.layers.8.mlp.down_proj.weight.2.bin
model.layers.25.mlp.down_proj.weight.1.bin model.layers.8.mlp.down_proj.weight.3.bin
model.layers.25.mlp.down_proj.weight.2.bin model.layers.8.mlp.gate_proj.weight.0.bin
model.layers.25.mlp.down_proj.weight.3.bin model.layers.8.mlp.gate_proj.weight.1.bin
model.layers.25.mlp.gate_proj.weight.0.bin model.layers.8.mlp.gate_proj.weight.2.bin
model.layers.25.mlp.gate_proj.weight.1.bin model.layers.8.mlp.gate_proj.weight.3.bin
model.layers.25.mlp.gate_proj.weight.2.bin model.layers.8.mlp.up_proj.weight.0.bin
model.layers.25.mlp.gate_proj.weight.3.bin model.layers.8.mlp.up_proj.weight.1.bin
model.layers.25.mlp.up_proj.weight.0.bin model.layers.8.mlp.up_proj.weight.2.bin
model.layers.25.mlp.up_proj.weight.1.bin model.layers.8.mlp.up_proj.weight.3.bin
model.layers.25.mlp.up_proj.weight.2.bin model.layers.8.post_attention_layernorm.weight.bin
model.layers.25.mlp.up_proj.weight.3.bin model.layers.9.attention.dense.weight.0.bin
model.layers.25.post_attention_layernorm.weight.bin model.layers.9.attention.dense.weight.1.bin
model.layers.26.attention.dense.weight.0.bin model.layers.9.attention.dense.weight.2.bin
model.layers.26.attention.dense.weight.1.bin model.layers.9.attention.dense.weight.3.bin
model.layers.26.attention.dense.weight.2.bin model.layers.9.attention.query_key_value.weight.0.bin
model.layers.26.attention.dense.weight.3.bin model.layers.9.attention.query_key_value.weight.1.bin
model.layers.26.attention.query_key_value.weight.0.bin model.layers.9.attention.query_key_value.weight.2.bin
model.layers.26.attention.query_key_value.weight.1.bin model.layers.9.attention.query_key_value.weight.3.bin
model.layers.26.attention.query_key_value.weight.2.bin model.layers.9.input_layernorm.weight.bin
model.layers.26.attention.query_key_value.weight.3.bin model.layers.9.mlp.down_proj.weight.0.bin
model.layers.26.input_layernorm.weight.bin model.layers.9.mlp.down_proj.weight.1.bin
model.layers.26.mlp.down_proj.weight.0.bin model.layers.9.mlp.down_proj.weight.2.bin
model.layers.26.mlp.down_proj.weight.1.bin model.layers.9.mlp.down_proj.weight.3.bin
model.layers.26.mlp.down_proj.weight.2.bin model.layers.9.mlp.gate_proj.weight.0.bin
model.layers.26.mlp.down_proj.weight.3.bin model.layers.9.mlp.gate_proj.weight.1.bin
model.layers.26.mlp.gate_proj.weight.0.bin model.layers.9.mlp.gate_proj.weight.2.bin
model.layers.26.mlp.gate_proj.weight.1.bin model.layers.9.mlp.gate_proj.weight.3.bin
model.layers.26.mlp.gate_proj.weight.2.bin model.layers.9.mlp.up_proj.weight.0.bin
model.layers.26.mlp.gate_proj.weight.3.bin model.layers.9.mlp.up_proj.weight.1.bin
model.layers.26.mlp.up_proj.weight.0.bin model.layers.9.mlp.up_proj.weight.2.bin
model.layers.26.mlp.up_proj.weight.1.bin model.layers.9.mlp.up_proj.weight.3.bin
model.layers.26.mlp.up_proj.weight.2.bin model.layers.9.post_attention_layernorm.weight.bin
model.layers.26.mlp.up_proj.weight.3.bin model.lm_head.weight.bin
model.layers.26.post_attention_layernorm.weight.bin model.wte.weight.bin
Is there code to autotune the cuda kernels like the GPT-J example? how could we modify the GPT-J kernel optimiser to work with LLaMA models?
@SupreethRao99 you can refer to this directory structure, triton server need a version directory name like 1,2...... https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gptj
@void-main failed to early stop with end_id
@void-main failed to early stop with
end_id
This MR fix the bug. https://github.com/NVIDIA/FasterTransformer/pull/584/commits/622af28de55a09a253a23945d22f3015def49713
@Lzhang-hub I followed the instructions and I'm getting the following error now
root@eda821372bac:/workspace# /opt/tritonserver/bin/tritonserver --model-repository=./models/vicuna-13b
I0506 07:54:26.454366 15432 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f758a000000' with size 268435456
I0506 07:54:26.458193 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0506 07:54:26.458208 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0506 07:54:26.458213 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0506 07:54:26.458225 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
W0506 07:54:26.876510 15432 server.cc:237] failed to enable peer access for some device pairs
I0506 07:54:26.889064 15432 model_lifecycle.cc:459] loading: fastertransformer:1
I0506 07:54:27.149907 15432 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer
I0506 07:54:27.149962 15432 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.12
I0506 07:54:27.149979 15432 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.12
I0506 07:54:27.828205 15432 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0506 07:54:27.829096 15432 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1
I0506 07:54:27.829118 15432 libfastertransformer.cc:402] Sequence Batching: disabled
I0506 07:54:27.829131 15432 libfastertransformer.cc:412] Dynamic Batching: disabled
I0506 07:54:27.829299 15432 libfastertransformer.cc:1899] TRITONBACKEND_ModelFinalize: delete model state
I0506 07:54:27.829311 15432 libfastertransformer.cc:1904] TRITONBACKEND_ModelFinalize: MPI Finalize
E0506 07:54:27.883287 15432 model_lifecycle.cc:597] failed to load 'fastertransformer' version 1: Unsupported: Unknown model "vicuna-13b"
I0506 07:54:27.883453 15432 server.cc:583]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0506 07:54:27.883517 15432 server.cc:610]
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| Backend | Path | Config |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtri | {"cmdline":{"auto-complete-config":"true","backend- |
| | ton_fastertransformer.so | directory":"/opt/tritonserver/backends","min-comput |
| | | e-capability":"6.000000","default-max-batch-size":" |
| | | 4"}} |
| | | |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
I0506 07:54:27.883574 15432 server.cc:653]
+-------------------+---------+------------------------------------------------------+
| Model | Version | Status |
+-------------------+---------+------------------------------------------------------+
| fastertransformer | 1 | UNAVAILABLE: Unsupported: Unknown model "vicuna-13b" |
+-------------------+---------+------------------------------------------------------+
I0506 07:54:27.933348 15432 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4
I0506 07:54:27.933420 15432 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4
I0506 07:54:27.933434 15432 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4
I0506 07:54:27.933451 15432 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4
I0506 07:54:27.934718 15432 metrics.cc:701] Collecting CPU metrics
I0506 07:54:27.935008 15432 tritonserver.cc:2387]
+----------------------------------+----------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+----------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.33.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy |
| | model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s |
| | tatistics trace logging |
| model_repository_path[0] | ./models/vicuna-13b |
| model_control_mode | MODE_NONE |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| cuda_memory_pool_byte_size{1} | 67108864 |
| cuda_memory_pool_byte_size{2} | 67108864 |
| cuda_memory_pool_byte_size{3} | 67108864 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
| cache_enabled | 0 |
+----------------------------------+----------------------------------------------------------------------------------------------+
I0506 07:54:27.935044 15432 server.cc:284] Waiting for in-flight requests to complete.
I0506 07:54:27.935056 15432 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences
I0506 07:54:27.935065 15432 server.cc:315] All models are stopped, unloading models
I0506 07:54:27.935077 15432 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
my models are now stored under workspace/models/vicuna-13b/fastertransformer
. that folder contains config.pbtxt
and a directory named 1
with all the model weights. my new config.pbtxt contains
name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "vicuna-13b"
max_batch_size: 1024
model_transaction_policy {
decoupled: False
}
input [
{
name: "input_ids"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "start_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "end_id"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "input_lengths"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
},
{
name: "request_output_len"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "runtime_top_k"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "runtime_top_p"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_search_diversity_rate"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "temperature"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "len_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "repetition_penalty"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "random_seed"
data_type: TYPE_UINT64
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "is_return_log_probs"
data_type: TYPE_BOOL
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "beam_width"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "bad_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "stop_words_list"
data_type: TYPE_INT32
dims: [ 2, -1 ]
optional: true
},
{
name: "prompt_learning_task_name_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_decay"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_min"
data_type: TYPE_FP32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
},
{
name: "top_p_reset_ids"
data_type: TYPE_UINT32
dims: [ 1 ]
reshape: { shape: [ ] }
optional: true
}
]
output [
{
name: "output_ids"
data_type: TYPE_UINT32
dims: [ -1, -1 ]
},
{
name: "sequence_length"
data_type: TYPE_UINT32
dims: [ -1 ]
},
{
name: "cum_log_probs"
data_type: TYPE_FP32
dims: [ -1 ]
},
{
name: "output_log_probs"
data_type: TYPE_FP32
dims: [ -1, -1 ]
}
]
instance_group [
{
count: 1
kind: KIND_CPU
}
]
parameters {
key: "tensor_para_size"
value: {
string_value: "4"
}
}
parameters {
key: "pipeline_para_size"
value: {
string_value: "1"
}
}
parameters {
key: "data_type"
value: {
string_value: "fp16"
}
}
parameters {
key: "model_type"
value: {
string_value: "vicuna-13b"
}
}
parameters {
key: "model_checkpoint_path"
value: {
string_value: "/workspace/models/vicuna-13b/fastertransformer/1/"
}
}
parameters {
key: "enable_custom_all_reduce"
value: {
string_value: "0"
}
}
@void-main failed to early stop with
end_id
This MR fix the bug. 622af28
got it, merging it now
@Lzhang-hub I followed the instructions and I'm getting the following error now
root@eda821372bac:/workspace# /opt/tritonserver/bin/tritonserver --model-repository=./models/vicuna-13b I0506 07:54:26.454366 15432 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f758a000000' with size 268435456 I0506 07:54:26.458193 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864 I0506 07:54:26.458208 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864 I0506 07:54:26.458213 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864 I0506 07:54:26.458225 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864 W0506 07:54:26.876510 15432 server.cc:237] failed to enable peer access for some device pairs I0506 07:54:26.889064 15432 model_lifecycle.cc:459] loading: fastertransformer:1 I0506 07:54:27.149907 15432 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer I0506 07:54:27.149962 15432 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.12 I0506 07:54:27.149979 15432 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.12 I0506 07:54:27.828205 15432 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1) I0506 07:54:27.829096 15432 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1 I0506 07:54:27.829118 15432 libfastertransformer.cc:402] Sequence Batching: disabled I0506 07:54:27.829131 15432 libfastertransformer.cc:412] Dynamic Batching: disabled I0506 07:54:27.829299 15432 libfastertransformer.cc:1899] TRITONBACKEND_ModelFinalize: delete model state I0506 07:54:27.829311 15432 libfastertransformer.cc:1904] TRITONBACKEND_ModelFinalize: MPI Finalize E0506 07:54:27.883287 15432 model_lifecycle.cc:597] failed to load 'fastertransformer' version 1: Unsupported: Unknown model "vicuna-13b" I0506 07:54:27.883453 15432 server.cc:583] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0506 07:54:27.883517 15432 server.cc:610] +-------------------+-----------------------------------------------------+-----------------------------------------------------+ | Backend | Path | Config | +-------------------+-----------------------------------------------------+-----------------------------------------------------+ | fastertransformer | /opt/tritonserver/backends/fastertransformer/libtri | {"cmdline":{"auto-complete-config":"true","backend- | | | ton_fastertransformer.so | directory":"/opt/tritonserver/backends","min-comput | | | | e-capability":"6.000000","default-max-batch-size":" | | | | 4"}} | | | | | +-------------------+-----------------------------------------------------+-----------------------------------------------------+ I0506 07:54:27.883574 15432 server.cc:653] +-------------------+---------+------------------------------------------------------+ | Model | Version | Status | +-------------------+---------+------------------------------------------------------+ | fastertransformer | 1 | UNAVAILABLE: Unsupported: Unknown model "vicuna-13b" | +-------------------+---------+------------------------------------------------------+ I0506 07:54:27.933348 15432 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4 I0506 07:54:27.933420 15432 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4 I0506 07:54:27.933434 15432 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4 I0506 07:54:27.933451 15432 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4 I0506 07:54:27.934718 15432 metrics.cc:701] Collecting CPU metrics I0506 07:54:27.935008 15432 tritonserver.cc:2387] +----------------------------------+----------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.33.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy | | | model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s | | | tatistics trace logging | | model_repository_path[0] | ./models/vicuna-13b | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+----------------------------------------------------------------------------------------------+ I0506 07:54:27.935044 15432 server.cc:284] Waiting for in-flight requests to complete. I0506 07:54:27.935056 15432 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences I0506 07:54:27.935065 15432 server.cc:315] All models are stopped, unloading models I0506 07:54:27.935077 15432 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models
my models are now stored under
workspace/models/vicuna-13b/fastertransformer
. that folder containsconfig.pbtxt
and a directory named1
with all the model weights. my new config.pbtxt containsname: "fastertransformer" backend: "fastertransformer" default_model_filename: "vicuna-13b" max_batch_size: 1024 model_transaction_policy { decoupled: False } input [ { name: "input_ids" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "start_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "input_lengths" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_search_diversity_rate" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "is_return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true }, { name: "prompt_learning_task_name_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_decay" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_min" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "top_p_reset_ids" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_UINT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind: KIND_CPU } ] parameters { key: "tensor_para_size" value: { string_value: "4" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "data_type" value: { string_value: "fp16" } } parameters { key: "model_type" value: { string_value: "vicuna-13b" } } parameters { key: "model_checkpoint_path" value: { string_value: "/workspace/models/vicuna-13b/fastertransformer/1/" } } parameters { key: "enable_custom_all_reduce" value: { string_value: "0" } }
@SupreethRao99 , try change the model_type
to Llama
in your vicuna-13b/fastertransformer/config.pbtxt
Given existing support for GPT-J and its rotary embeddings, is LLaMA supported as well? Huggingface just shipped their implementation: https://github.com/huggingface/transformers/commit/464d420775653885760e30d24d3703e14f4e8a14
@byshiue