LLaMA support - Githubissues

Given existing support for GPT-J and its rotary embeddings, is LLaMA supported as well? Huggingface just shipped their implementation: https://github.com/huggingface/transformers/commit/464d420775653885760e30d24d3703e14f4e8a14

@byshiue

Can you introduce what's difference between GPT-J and LLaMA?

They look very similar. In HuggingFace's doc page, they say that the implementation is based on the GPT-NeoX codebase, which seems to be supported by FasterTransformer: https://huggingface.co/docs/transformers/main/model_doc/llama.

Do you think it'll work?

@byshiue According to our investigation, it is not difficult to portal this model to Megatron as well. But I am not sure will one convert script works.

Thank you for the suggestion and discussion. We may not have time to work on that issue right now. If you are interesting, you can try to support it. It is welcome to ask question if you encounter any question, and merge back into our repo if you can support it.

It seems to be quite a simple implementation @byshiue. All that needs to be done is implement RMS layer norm in GPT-NeoX, as well as support the SILU activation. It seem that both of these features are already implemented elsewhere in FasterTransformer.

I'd be happy to take the lead if you can help me with the general steps.

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue

I compared the GPT-j and llama models in huggingface, they have the same attention layer. There are some differences in FFN, llama uses 3 weights, and the forward function is as follows
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
I checked the relevant code of the ffn layer in the source code, and it seems that there is no similar structure. Or such a layer already exists in the current code and I have not found it, I hope to get some tips. @byshiue

It looks like a standard gated silu. Can you explain what difference do you think?

Thanks for the reminder, I missed this part. I will try to make this work

Wow, thank you @moonscar. Want any help? What's the status of your PR?

moonscar

have you started this work? or i can help with it.

Don't think it's been started yet @Anychnn

Given the interest and activity here, I'd like to offer a bounty of $2,500 USD to whoever can get Llama implemented in FT. Please email me at michael@phind.com if you're interested. @moonscar @AnShengqiang @Anychnn @byshiue

It seems that all that needs to be done is copy over T5's RMS layer norm (already implemented in FT) and UL2's gated-silu (also already implemented elsewhere in FT) into GPT-NeoX. As per the Huggingface's implementation of Llama, it is otherwise completely identical to GPT-NeoX (which is already implemented in FT).

The bounty will be $3,000 if a correct and working PR is opened by the end of Friday, April 21st (Pacific Time).

would be glad to help do a part of the work, for example converting the weights to FT

Made alot of progress on this, but my current FT model is outputting seemingly random tokens, so there's something wrong with my weight conversion or maybe even the exact layer implementation. If someone wants to pick up the torch (I am done for now 😞) the next step would prob be to compare layer-by-layer the output of the Huggingface model vs. this FT model:

Weights conversion: https://github.com/cameronfr/FasterTransformer/blob/main/examples/cpp/llama/huggingface_llama_convert.py FT Model: https://github.com/cameronfr/FasterTransformer/tree/main/src/fastertransformer/models/llama Testing: https://github.com/cameronfr/FasterTransformer/tree/main/examples/cpp/llama

Everything is modified from the respective GPTNeoX versions. LlamaContextDecoder and LlamaDecoder essentially just have the changes of Gelu -> Gated Silu and LayerNorm -> LayerNormT5. LlamaDecoderLayerWeight and LlamaWeight set the parameters of these layers.

@cameronfr The default layernormeps of llama.h is set to be 1e-5, but llama-7b-torch set it default to 1e-6. And the attention module output is also incorrect, I am fixing this.

@cameronfr I think the reshape of qkv here might not be correct https://github.com/cameronfr/FasterTransformer/blob/45d48f9d06713cd006f7d95d4b2f99a4bd3abb11/examples/cpp/llama/huggingface_llama_convert.py#L97 Since the huggingface format qkv proj is prepared for rotary embedding https://github.com/huggingface/transformers/blob/d04ec99bec8a0b432fc03ed60cea9a1a20ebaf3c/src/transformers/models/llama/convert_llama_weights_to_hf.py#L101 So I tried something like : qkvArr[:, 0, :, :] = qArr.reshape(n_heads,2, head_size//2, hidden_size).transpose((3,0,2,1)).reshape(hidden_size,n_heads,head_size) and fixed the layernorm_eps, but the output tokens are still seemingly incorrect, not a sentence. Also I changed the start_ids.csv not to use the one in gptneox, since they may not share the same token ids.

Great progress @cameronfr @Anychnn @jinluyang. I'm doubling the bounty to $6k to whoever can get this working and merged in.

Hey @michaelroyzen @cameronfr @Anychnn @jinluyang , I got a self-tested working version and opened a pull request with it. Could you guys please take a look? Any chances we could get it merged?

Nice! Works well so far in limited tests and is consistent with the Huggingface output using beam_size 1. One comment is that it should support max_position_embeddings (max_pos_seq_len in FT), but this is likely a simple change. Will continue testing and post the updates here.

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

@michaelroyzen Does FT support the fine-tuned LLaMA with Lora? Training code is as follows: https://github.com/tloen/alpaca-lora/blob/main/finetune.py

Use the merge_adapter interface can merge lora weights into original linear weights. https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora.py#L279

Hey community, here are some updates:

supported bf16
supported triton decouple mode
verified that Llama 65B is working

Hey, a tutorial on how to run LLaMA with the FasterTransformer backend would be really helpful! would be happy to contribute

Hey, a tutorial on how to run LLaMA with the FasterTransformer backend would be really helpful! would be happy to contribute

sure, will provide a step by step tutorial on how to run HF llama later

@void-main I checked llama-13b inference result on FasterTransformer,it is about 8 second per request on A100, the greedy-search result is consistent with huggingface. good job!

That's quiet fast @Anychnn , could you tell me the steps you did to get it running briefly ? Thanks !!

I'd love to see quantization such as in GPT-Q, what amazing work guys thank you all! ❤️

@void-main I checked llama-13b inference result on FasterTransformer,it is about 8 second per request on A100, the greedy-search result is consistent with huggingface. good job!

Hi @Anychnn , how much time does it take per request on A100 with huggingface implementation?

Does FasterTransformer support quantization?

there seems to existing two version of the huggingface LLaMA weights converter, the older one had issues with BOS, EOS tokens, the newer converter fixes that issues, which versions of LLaMA (converted using v1 converter, or converted using v2 converter) does this PR work with ? Thanks

Does FasterTransformer support quantization?

i believe the best option we have is INT8 weight only qauntization, which is supported by FT (but not in Llama implementation)

there seems to existing two version of the huggingface LLaMA weights converter, the older one had issues with BOS, EOS tokens, the newer converter fixes that issues, which versions of LLaMA (converted using v1 converter, or converted using v2 converter) does this PR work with ? Thanks

@SupreethRao99 could you please point me to these converters?

@void-main , the converters can be found here https://github.com/huggingface/transformers/commits/main/src/transformers/models/llama/convert_llama_weights_to_hf.py if we take a look a its commit history, there's a fix to the tokenizer on 3rd April 2023. v1 seems to be the converter before this date and v2 to be after this date.

FT supports the newer converter @SupreethRao99

Hey, I'm trying to run LLaMA with the fastertransformer backend on a triton inference server, I am closely following this tutorial (https://towardsdatascience.com/deploy-your-local-gpt-server-with-triton-a825d528aa5d) and made the following changes.

I changed the docker file in the faster-transformer-backend to

# Copyright (c) 2021-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

ARG TRITON_VERSION=23.04
ARG BASE_IMAGE=nvcr.io/nvidia/tritonserver:${TRITON_VERSION}-py3
FROM ${BASE_IMAGE}

RUN apt-get update
RUN apt-get install -y --no-install-recommends \
        autoconf \
        autogen \
        clangd \
        cmake \
        gdb \
        git-lfs \
        libb64-dev \
        libz-dev \
        locales-all \
        mosh \
        openssh-server \
        python3-dev \
        rapidjson-dev \
        sudo \
        tmux \
        unzip \
        xz-utils \
        zstd \
        zip \
        zsh
RUN pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117 && \
    pip3 install --extra-index-url https://pypi.ngc.nvidia.com regex fire ipywidgets tritonclient[all] && \
    pip3 install transformers huggingface_hub tokenizers SentencePiece sacrebleu datasets tqdm omegaconf rouge_score && \
    pip3 install cmake==3.24.3

RUN apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# backend build
ADD . /workspace/build/fastertransformer_backend
RUN mkdir -p /workspace/build/fastertransformer_backend/build

WORKDIR /workspace/build/fastertransformer_backend/build
ARG FORCE_BACKEND_REBUILD=0
RUN cmake \
      -D CMAKE_EXPORT_COMPILE_COMMANDS=1 \
      -D CMAKE_BUILD_TYPE=Release \
      -D ENABLE_FP8=OFF \
      -D CMAKE_INSTALL_PREFIX=/opt/tritonserver \
      -D TRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      -D TRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \
      ..
RUN cd _deps/repo-ft-src/ && \
    git log | head -n 3 2>&1 | tee /workspace/build/fastertransformer_backend/FT_version.txt && \
    cd /workspace/build/fastertransformer_backend/build && \
    make -j"$(grep -c ^processor /proc/cpuinfo)" install && \
    rm /workspace/build/fastertransformer_backend/build/bin/*_example -rf && \
    rm /workspace/build/fastertransformer_backend/build/lib/lib*Backend.so -rf

ENV NCCL_LAUNCH_MODE=GROUP
ENV WORKSPACE /workspace
WORKDIR /workspace

RUN sed -i 's/#X11UseLocalhost yes/X11UseLocalhost no/g' /etc/ssh/sshd_config && \
    mkdir /var/run/sshd -p

RUN ln -sf /usr/bin/python3.8 /usr/bin/python

I then cloned the FasterTransformer library and pulled the new llama additions from the pull request associated with this issues #575 .

I then followed the repository till the end and used the following config.pbtext from the gpt-j example

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "vicuna-13b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_reset_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "4"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "vicuna-13b"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "vicuna-13b/4-gpu/"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

I had previously converted the model to the fastertransformer format by the script provided as part of the PR. I then ran /opt/tritonserver/bin/tritonserver --model-repository=./vicuna-13b-ft and got the following error

I0506 03:39:13.088086 14496 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f0fe2000000' with size 268435456
I0506 03:39:13.091928 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0506 03:39:13.091940 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0506 03:39:13.091948 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0506 03:39:13.091955 14496 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
W0506 03:39:13.556236 14496 server.cc:237] failed to enable peer access for some device pairs
E0506 03:39:13.568928 14496 model_repository_manager.cc:1245] Poll failed for model directory '4-gpu': Invalid model name: Could not determine backend for model '4-gpu' with no backend in model configuration. Expected model name of the form 'model.<backend_name>'.
I0506 03:39:13.568997 14496 server.cc:583] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0506 03:39:13.569016 14496 server.cc:610] 
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0506 03:39:13.569030 14496 server.cc:653] 
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0506 03:39:13.619927 14496 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4
I0506 03:39:13.619983 14496 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4
I0506 03:39:13.619996 14496 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4
I0506 03:39:13.620008 14496 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4
I0506 03:39:13.621315 14496 metrics.cc:701] Collecting CPU metrics
I0506 03:39:13.621599 14496 tritonserver.cc:2387] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                               |
| server_version                   | 2.33.0                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
|                                  | or_data parameters statistics trace logging                                                                                                                          |
| model_repository_path[0]         | ./vicuna-13b-ft                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                            |
| strict_model_config              | 0                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                             |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                                                                                             |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                                                                                             |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                                                                                             |
| min_supported_compute_capability | 6.0                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                   |
| cache_enabled                    | 0                                                                                                                                                                    |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0506 03:39:13.621653 14496 server.cc:284] Waiting for in-flight requests to complete.
I0506 03:39:13.621662 14496 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences
I0506 03:39:13.621669 14496 server.cc:315] All models are stopped, unloading models
I0506 03:39:13.621674 14496 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

Is my process correct? could anyone help me. Thank you!

@SupreethRao99 E0506 03:39:13.568928 14496 model_repository_manager.cc:1245] Poll failed for model directory '4-gpu': Invalid model name: Could not determine backend for model '4-gpu' with no backend in model configuration. Expected model name of the form 'model.'.

as the log show, it may be your directory name is wrong, can you show the tree of ./vicuna-13b-ft

sure, this is the content of ./vicuna-13b-ft

4-gpu
config.pbtxt

inside 4-gpu I have these files

config.ini                                              model.layers.27.attention.dense.weight.0.bin
model.final_layernorm.weight.bin                        model.layers.27.attention.dense.weight.1.bin
model.layers.0.attention.dense.weight.0.bin             model.layers.27.attention.dense.weight.2.bin
model.layers.0.attention.dense.weight.1.bin             model.layers.27.attention.dense.weight.3.bin
model.layers.0.attention.dense.weight.2.bin             model.layers.27.attention.query_key_value.weight.0.bin
model.layers.0.attention.dense.weight.3.bin             model.layers.27.attention.query_key_value.weight.1.bin
model.layers.0.attention.query_key_value.weight.0.bin   model.layers.27.attention.query_key_value.weight.2.bin
model.layers.0.attention.query_key_value.weight.1.bin   model.layers.27.attention.query_key_value.weight.3.bin
model.layers.0.attention.query_key_value.weight.2.bin   model.layers.27.input_layernorm.weight.bin
model.layers.0.attention.query_key_value.weight.3.bin   model.layers.27.mlp.down_proj.weight.0.bin
model.layers.0.input_layernorm.weight.bin               model.layers.27.mlp.down_proj.weight.1.bin
model.layers.0.mlp.down_proj.weight.0.bin               model.layers.27.mlp.down_proj.weight.2.bin
model.layers.0.mlp.down_proj.weight.1.bin               model.layers.27.mlp.down_proj.weight.3.bin
model.layers.0.mlp.down_proj.weight.2.bin               model.layers.27.mlp.gate_proj.weight.0.bin
model.layers.0.mlp.down_proj.weight.3.bin               model.layers.27.mlp.gate_proj.weight.1.bin
model.layers.0.mlp.gate_proj.weight.0.bin               model.layers.27.mlp.gate_proj.weight.2.bin
model.layers.0.mlp.gate_proj.weight.1.bin               model.layers.27.mlp.gate_proj.weight.3.bin
model.layers.0.mlp.gate_proj.weight.2.bin               model.layers.27.mlp.up_proj.weight.0.bin
model.layers.0.mlp.gate_proj.weight.3.bin               model.layers.27.mlp.up_proj.weight.1.bin
model.layers.0.mlp.up_proj.weight.0.bin                 model.layers.27.mlp.up_proj.weight.2.bin
model.layers.0.mlp.up_proj.weight.1.bin                 model.layers.27.mlp.up_proj.weight.3.bin
model.layers.0.mlp.up_proj.weight.2.bin                 model.layers.27.post_attention_layernorm.weight.bin
model.layers.0.mlp.up_proj.weight.3.bin                 model.layers.28.attention.dense.weight.0.bin
model.layers.0.post_attention_layernorm.weight.bin      model.layers.28.attention.dense.weight.1.bin
model.layers.1.attention.dense.weight.0.bin             model.layers.28.attention.dense.weight.2.bin
model.layers.1.attention.dense.weight.1.bin             model.layers.28.attention.dense.weight.3.bin
model.layers.1.attention.dense.weight.2.bin             model.layers.28.attention.query_key_value.weight.0.bin
model.layers.1.attention.dense.weight.3.bin             model.layers.28.attention.query_key_value.weight.1.bin
model.layers.1.attention.query_key_value.weight.0.bin   model.layers.28.attention.query_key_value.weight.2.bin
model.layers.1.attention.query_key_value.weight.1.bin   model.layers.28.attention.query_key_value.weight.3.bin
model.layers.1.attention.query_key_value.weight.2.bin   model.layers.28.input_layernorm.weight.bin
model.layers.1.attention.query_key_value.weight.3.bin   model.layers.28.mlp.down_proj.weight.0.bin
model.layers.1.input_layernorm.weight.bin               model.layers.28.mlp.down_proj.weight.1.bin
model.layers.1.mlp.down_proj.weight.0.bin               model.layers.28.mlp.down_proj.weight.2.bin
model.layers.1.mlp.down_proj.weight.1.bin               model.layers.28.mlp.down_proj.weight.3.bin
model.layers.1.mlp.down_proj.weight.2.bin               model.layers.28.mlp.gate_proj.weight.0.bin
model.layers.1.mlp.down_proj.weight.3.bin               model.layers.28.mlp.gate_proj.weight.1.bin
model.layers.1.mlp.gate_proj.weight.0.bin               model.layers.28.mlp.gate_proj.weight.2.bin
model.layers.1.mlp.gate_proj.weight.1.bin               model.layers.28.mlp.gate_proj.weight.3.bin
model.layers.1.mlp.gate_proj.weight.2.bin               model.layers.28.mlp.up_proj.weight.0.bin
model.layers.1.mlp.gate_proj.weight.3.bin               model.layers.28.mlp.up_proj.weight.1.bin
model.layers.1.mlp.up_proj.weight.0.bin                 model.layers.28.mlp.up_proj.weight.2.bin
model.layers.1.mlp.up_proj.weight.1.bin                 model.layers.28.mlp.up_proj.weight.3.bin
model.layers.1.mlp.up_proj.weight.2.bin                 model.layers.28.post_attention_layernorm.weight.bin
model.layers.1.mlp.up_proj.weight.3.bin                 model.layers.29.attention.dense.weight.0.bin
model.layers.1.post_attention_layernorm.weight.bin      model.layers.29.attention.dense.weight.1.bin
model.layers.10.attention.dense.weight.0.bin            model.layers.29.attention.dense.weight.2.bin
model.layers.10.attention.dense.weight.1.bin            model.layers.29.attention.dense.weight.3.bin
model.layers.10.attention.dense.weight.2.bin            model.layers.29.attention.query_key_value.weight.0.bin
model.layers.10.attention.dense.weight.3.bin            model.layers.29.attention.query_key_value.weight.1.bin
model.layers.10.attention.query_key_value.weight.0.bin  model.layers.29.attention.query_key_value.weight.2.bin
model.layers.10.attention.query_key_value.weight.1.bin  model.layers.29.attention.query_key_value.weight.3.bin
model.layers.10.attention.query_key_value.weight.2.bin  model.layers.29.input_layernorm.weight.bin
model.layers.10.attention.query_key_value.weight.3.bin  model.layers.29.mlp.down_proj.weight.0.bin
model.layers.10.input_layernorm.weight.bin              model.layers.29.mlp.down_proj.weight.1.bin
model.layers.10.mlp.down_proj.weight.0.bin              model.layers.29.mlp.down_proj.weight.2.bin
model.layers.10.mlp.down_proj.weight.1.bin              model.layers.29.mlp.down_proj.weight.3.bin
model.layers.10.mlp.down_proj.weight.2.bin              model.layers.29.mlp.gate_proj.weight.0.bin
model.layers.10.mlp.down_proj.weight.3.bin              model.layers.29.mlp.gate_proj.weight.1.bin
model.layers.10.mlp.gate_proj.weight.0.bin              model.layers.29.mlp.gate_proj.weight.2.bin
model.layers.10.mlp.gate_proj.weight.1.bin              model.layers.29.mlp.gate_proj.weight.3.bin
model.layers.10.mlp.gate_proj.weight.2.bin              model.layers.29.mlp.up_proj.weight.0.bin
model.layers.10.mlp.gate_proj.weight.3.bin              model.layers.29.mlp.up_proj.weight.1.bin
model.layers.10.mlp.up_proj.weight.0.bin                model.layers.29.mlp.up_proj.weight.2.bin
model.layers.10.mlp.up_proj.weight.1.bin                model.layers.29.mlp.up_proj.weight.3.bin
model.layers.10.mlp.up_proj.weight.2.bin                model.layers.29.post_attention_layernorm.weight.bin
model.layers.10.mlp.up_proj.weight.3.bin                model.layers.3.attention.dense.weight.0.bin
model.layers.10.post_attention_layernorm.weight.bin     model.layers.3.attention.dense.weight.1.bin
model.layers.11.attention.dense.weight.0.bin            model.layers.3.attention.dense.weight.2.bin
model.layers.11.attention.dense.weight.1.bin            model.layers.3.attention.dense.weight.3.bin
model.layers.11.attention.dense.weight.2.bin            model.layers.3.attention.query_key_value.weight.0.bin
model.layers.11.attention.dense.weight.3.bin            model.layers.3.attention.query_key_value.weight.1.bin
model.layers.11.attention.query_key_value.weight.0.bin  model.layers.3.attention.query_key_value.weight.2.bin
model.layers.11.attention.query_key_value.weight.1.bin  model.layers.3.attention.query_key_value.weight.3.bin
model.layers.11.attention.query_key_value.weight.2.bin  model.layers.3.input_layernorm.weight.bin
model.layers.11.attention.query_key_value.weight.3.bin  model.layers.3.mlp.down_proj.weight.0.bin
model.layers.11.input_layernorm.weight.bin              model.layers.3.mlp.down_proj.weight.1.bin
model.layers.11.mlp.down_proj.weight.0.bin              model.layers.3.mlp.down_proj.weight.2.bin
model.layers.11.mlp.down_proj.weight.1.bin              model.layers.3.mlp.down_proj.weight.3.bin
model.layers.11.mlp.down_proj.weight.2.bin              model.layers.3.mlp.gate_proj.weight.0.bin
model.layers.11.mlp.down_proj.weight.3.bin              model.layers.3.mlp.gate_proj.weight.1.bin
model.layers.11.mlp.gate_proj.weight.0.bin              model.layers.3.mlp.gate_proj.weight.2.bin
model.layers.11.mlp.gate_proj.weight.1.bin              model.layers.3.mlp.gate_proj.weight.3.bin
model.layers.11.mlp.gate_proj.weight.2.bin              model.layers.3.mlp.up_proj.weight.0.bin
model.layers.11.mlp.gate_proj.weight.3.bin              model.layers.3.mlp.up_proj.weight.1.bin
model.layers.11.mlp.up_proj.weight.0.bin                model.layers.3.mlp.up_proj.weight.2.bin
model.layers.11.mlp.up_proj.weight.1.bin                model.layers.3.mlp.up_proj.weight.3.bin
model.layers.11.mlp.up_proj.weight.2.bin                model.layers.3.post_attention_layernorm.weight.bin
model.layers.11.mlp.up_proj.weight.3.bin                model.layers.30.attention.dense.weight.0.bin
model.layers.11.post_attention_layernorm.weight.bin     model.layers.30.attention.dense.weight.1.bin
model.layers.12.attention.dense.weight.0.bin            model.layers.30.attention.dense.weight.2.bin
model.layers.12.attention.dense.weight.1.bin            model.layers.30.attention.dense.weight.3.bin
model.layers.12.attention.dense.weight.2.bin            model.layers.30.attention.query_key_value.weight.0.bin
model.layers.12.attention.dense.weight.3.bin            model.layers.30.attention.query_key_value.weight.1.bin
model.layers.12.attention.query_key_value.weight.0.bin  model.layers.30.attention.query_key_value.weight.2.bin
model.layers.12.attention.query_key_value.weight.1.bin  model.layers.30.attention.query_key_value.weight.3.bin
model.layers.12.attention.query_key_value.weight.2.bin  model.layers.30.input_layernorm.weight.bin
model.layers.12.attention.query_key_value.weight.3.bin  model.layers.30.mlp.down_proj.weight.0.bin
model.layers.12.input_layernorm.weight.bin              model.layers.30.mlp.down_proj.weight.1.bin
model.layers.12.mlp.down_proj.weight.0.bin              model.layers.30.mlp.down_proj.weight.2.bin
model.layers.12.mlp.down_proj.weight.1.bin              model.layers.30.mlp.down_proj.weight.3.bin
model.layers.12.mlp.down_proj.weight.2.bin              model.layers.30.mlp.gate_proj.weight.0.bin
model.layers.12.mlp.down_proj.weight.3.bin              model.layers.30.mlp.gate_proj.weight.1.bin
model.layers.12.mlp.gate_proj.weight.0.bin              model.layers.30.mlp.gate_proj.weight.2.bin
model.layers.12.mlp.gate_proj.weight.1.bin              model.layers.30.mlp.gate_proj.weight.3.bin
model.layers.12.mlp.gate_proj.weight.2.bin              model.layers.30.mlp.up_proj.weight.0.bin
model.layers.12.mlp.gate_proj.weight.3.bin              model.layers.30.mlp.up_proj.weight.1.bin
model.layers.12.mlp.up_proj.weight.0.bin                model.layers.30.mlp.up_proj.weight.2.bin
model.layers.12.mlp.up_proj.weight.1.bin                model.layers.30.mlp.up_proj.weight.3.bin
model.layers.12.mlp.up_proj.weight.2.bin                model.layers.30.post_attention_layernorm.weight.bin
model.layers.12.mlp.up_proj.weight.3.bin                model.layers.31.attention.dense.weight.0.bin
model.layers.12.post_attention_layernorm.weight.bin     model.layers.31.attention.dense.weight.1.bin
model.layers.13.attention.dense.weight.0.bin            model.layers.31.attention.dense.weight.2.bin
model.layers.13.attention.dense.weight.1.bin            model.layers.31.attention.dense.weight.3.bin
model.layers.13.attention.dense.weight.2.bin            model.layers.31.attention.query_key_value.weight.0.bin
model.layers.13.attention.dense.weight.3.bin            model.layers.31.attention.query_key_value.weight.1.bin
model.layers.13.attention.query_key_value.weight.0.bin  model.layers.31.attention.query_key_value.weight.2.bin
model.layers.13.attention.query_key_value.weight.1.bin  model.layers.31.attention.query_key_value.weight.3.bin
model.layers.13.attention.query_key_value.weight.2.bin  model.layers.31.input_layernorm.weight.bin
model.layers.13.attention.query_key_value.weight.3.bin  model.layers.31.mlp.down_proj.weight.0.bin
model.layers.13.input_layernorm.weight.bin              model.layers.31.mlp.down_proj.weight.1.bin
model.layers.13.mlp.down_proj.weight.0.bin              model.layers.31.mlp.down_proj.weight.2.bin
model.layers.13.mlp.down_proj.weight.1.bin              model.layers.31.mlp.down_proj.weight.3.bin
model.layers.13.mlp.down_proj.weight.2.bin              model.layers.31.mlp.gate_proj.weight.0.bin
model.layers.13.mlp.down_proj.weight.3.bin              model.layers.31.mlp.gate_proj.weight.1.bin
model.layers.13.mlp.gate_proj.weight.0.bin              model.layers.31.mlp.gate_proj.weight.2.bin
model.layers.13.mlp.gate_proj.weight.1.bin              model.layers.31.mlp.gate_proj.weight.3.bin
model.layers.13.mlp.gate_proj.weight.2.bin              model.layers.31.mlp.up_proj.weight.0.bin
model.layers.13.mlp.gate_proj.weight.3.bin              model.layers.31.mlp.up_proj.weight.1.bin
model.layers.13.mlp.up_proj.weight.0.bin                model.layers.31.mlp.up_proj.weight.2.bin
model.layers.13.mlp.up_proj.weight.1.bin                model.layers.31.mlp.up_proj.weight.3.bin
model.layers.13.mlp.up_proj.weight.2.bin                model.layers.31.post_attention_layernorm.weight.bin
model.layers.13.mlp.up_proj.weight.3.bin                model.layers.32.attention.dense.weight.0.bin
model.layers.13.post_attention_layernorm.weight.bin     model.layers.32.attention.dense.weight.1.bin
model.layers.14.attention.dense.weight.0.bin            model.layers.32.attention.dense.weight.2.bin
model.layers.14.attention.dense.weight.1.bin            model.layers.32.attention.dense.weight.3.bin
model.layers.14.attention.dense.weight.2.bin            model.layers.32.attention.query_key_value.weight.0.bin
model.layers.14.attention.dense.weight.3.bin            model.layers.32.attention.query_key_value.weight.1.bin
model.layers.14.attention.query_key_value.weight.0.bin  model.layers.32.attention.query_key_value.weight.2.bin
model.layers.14.attention.query_key_value.weight.1.bin  model.layers.32.attention.query_key_value.weight.3.bin
model.layers.14.attention.query_key_value.weight.2.bin  model.layers.32.input_layernorm.weight.bin
model.layers.14.attention.query_key_value.weight.3.bin  model.layers.32.mlp.down_proj.weight.0.bin
model.layers.14.input_layernorm.weight.bin              model.layers.32.mlp.down_proj.weight.1.bin
model.layers.14.mlp.down_proj.weight.0.bin              model.layers.32.mlp.down_proj.weight.2.bin
model.layers.14.mlp.down_proj.weight.1.bin              model.layers.32.mlp.down_proj.weight.3.bin
model.layers.14.mlp.down_proj.weight.2.bin              model.layers.32.mlp.gate_proj.weight.0.bin
model.layers.14.mlp.down_proj.weight.3.bin              model.layers.32.mlp.gate_proj.weight.1.bin
model.layers.14.mlp.gate_proj.weight.0.bin              model.layers.32.mlp.gate_proj.weight.2.bin
model.layers.14.mlp.gate_proj.weight.1.bin              model.layers.32.mlp.gate_proj.weight.3.bin
model.layers.14.mlp.gate_proj.weight.2.bin              model.layers.32.mlp.up_proj.weight.0.bin
model.layers.14.mlp.gate_proj.weight.3.bin              model.layers.32.mlp.up_proj.weight.1.bin
model.layers.14.mlp.up_proj.weight.0.bin                model.layers.32.mlp.up_proj.weight.2.bin
model.layers.14.mlp.up_proj.weight.1.bin                model.layers.32.mlp.up_proj.weight.3.bin
model.layers.14.mlp.up_proj.weight.2.bin                model.layers.32.post_attention_layernorm.weight.bin
model.layers.14.mlp.up_proj.weight.3.bin                model.layers.33.attention.dense.weight.0.bin
model.layers.14.post_attention_layernorm.weight.bin     model.layers.33.attention.dense.weight.1.bin
model.layers.15.attention.dense.weight.0.bin            model.layers.33.attention.dense.weight.2.bin
model.layers.15.attention.dense.weight.1.bin            model.layers.33.attention.dense.weight.3.bin
model.layers.15.attention.dense.weight.2.bin            model.layers.33.attention.query_key_value.weight.0.bin
model.layers.15.attention.dense.weight.3.bin            model.layers.33.attention.query_key_value.weight.1.bin
model.layers.15.attention.query_key_value.weight.0.bin  model.layers.33.attention.query_key_value.weight.2.bin
model.layers.15.attention.query_key_value.weight.1.bin  model.layers.33.attention.query_key_value.weight.3.bin
model.layers.15.attention.query_key_value.weight.2.bin  model.layers.33.input_layernorm.weight.bin
model.layers.15.attention.query_key_value.weight.3.bin  model.layers.33.mlp.down_proj.weight.0.bin
model.layers.15.input_layernorm.weight.bin              model.layers.33.mlp.down_proj.weight.1.bin
model.layers.15.mlp.down_proj.weight.0.bin              model.layers.33.mlp.down_proj.weight.2.bin
model.layers.15.mlp.down_proj.weight.1.bin              model.layers.33.mlp.down_proj.weight.3.bin
model.layers.15.mlp.down_proj.weight.2.bin              model.layers.33.mlp.gate_proj.weight.0.bin
model.layers.15.mlp.down_proj.weight.3.bin              model.layers.33.mlp.gate_proj.weight.1.bin
model.layers.15.mlp.gate_proj.weight.0.bin              model.layers.33.mlp.gate_proj.weight.2.bin
model.layers.15.mlp.gate_proj.weight.1.bin              model.layers.33.mlp.gate_proj.weight.3.bin
model.layers.15.mlp.gate_proj.weight.2.bin              model.layers.33.mlp.up_proj.weight.0.bin
model.layers.15.mlp.gate_proj.weight.3.bin              model.layers.33.mlp.up_proj.weight.1.bin
model.layers.15.mlp.up_proj.weight.0.bin                model.layers.33.mlp.up_proj.weight.2.bin
model.layers.15.mlp.up_proj.weight.1.bin                model.layers.33.mlp.up_proj.weight.3.bin
model.layers.15.mlp.up_proj.weight.2.bin                model.layers.33.post_attention_layernorm.weight.bin
model.layers.15.mlp.up_proj.weight.3.bin                model.layers.34.attention.dense.weight.0.bin
model.layers.15.post_attention_layernorm.weight.bin     model.layers.34.attention.dense.weight.1.bin
model.layers.16.attention.dense.weight.0.bin            model.layers.34.attention.dense.weight.2.bin
model.layers.16.attention.dense.weight.1.bin            model.layers.34.attention.dense.weight.3.bin
model.layers.16.attention.dense.weight.2.bin            model.layers.34.attention.query_key_value.weight.0.bin
model.layers.16.attention.dense.weight.3.bin            model.layers.34.attention.query_key_value.weight.1.bin
model.layers.16.attention.query_key_value.weight.0.bin  model.layers.34.attention.query_key_value.weight.2.bin
model.layers.16.attention.query_key_value.weight.1.bin  model.layers.34.attention.query_key_value.weight.3.bin
model.layers.16.attention.query_key_value.weight.2.bin  model.layers.34.input_layernorm.weight.bin
model.layers.16.attention.query_key_value.weight.3.bin  model.layers.34.mlp.down_proj.weight.0.bin
model.layers.16.input_layernorm.weight.bin              model.layers.34.mlp.down_proj.weight.1.bin
model.layers.16.mlp.down_proj.weight.0.bin              model.layers.34.mlp.down_proj.weight.2.bin
model.layers.16.mlp.down_proj.weight.1.bin              model.layers.34.mlp.down_proj.weight.3.bin
model.layers.16.mlp.down_proj.weight.2.bin              model.layers.34.mlp.gate_proj.weight.0.bin
model.layers.16.mlp.down_proj.weight.3.bin              model.layers.34.mlp.gate_proj.weight.1.bin
model.layers.16.mlp.gate_proj.weight.0.bin              model.layers.34.mlp.gate_proj.weight.2.bin
model.layers.16.mlp.gate_proj.weight.1.bin              model.layers.34.mlp.gate_proj.weight.3.bin
model.layers.16.mlp.gate_proj.weight.2.bin              model.layers.34.mlp.up_proj.weight.0.bin
model.layers.16.mlp.gate_proj.weight.3.bin              model.layers.34.mlp.up_proj.weight.1.bin
model.layers.16.mlp.up_proj.weight.0.bin                model.layers.34.mlp.up_proj.weight.2.bin
model.layers.16.mlp.up_proj.weight.1.bin                model.layers.34.mlp.up_proj.weight.3.bin
model.layers.16.mlp.up_proj.weight.2.bin                model.layers.34.post_attention_layernorm.weight.bin
model.layers.16.mlp.up_proj.weight.3.bin                model.layers.35.attention.dense.weight.0.bin
model.layers.16.post_attention_layernorm.weight.bin     model.layers.35.attention.dense.weight.1.bin
model.layers.17.attention.dense.weight.0.bin            model.layers.35.attention.dense.weight.2.bin
model.layers.17.attention.dense.weight.1.bin            model.layers.35.attention.dense.weight.3.bin
model.layers.17.attention.dense.weight.2.bin            model.layers.35.attention.query_key_value.weight.0.bin
model.layers.17.attention.dense.weight.3.bin            model.layers.35.attention.query_key_value.weight.1.bin
model.layers.17.attention.query_key_value.weight.0.bin  model.layers.35.attention.query_key_value.weight.2.bin
model.layers.17.attention.query_key_value.weight.1.bin  model.layers.35.attention.query_key_value.weight.3.bin
model.layers.17.attention.query_key_value.weight.2.bin  model.layers.35.input_layernorm.weight.bin
model.layers.17.attention.query_key_value.weight.3.bin  model.layers.35.mlp.down_proj.weight.0.bin
model.layers.17.input_layernorm.weight.bin              model.layers.35.mlp.down_proj.weight.1.bin
model.layers.17.mlp.down_proj.weight.0.bin              model.layers.35.mlp.down_proj.weight.2.bin
model.layers.17.mlp.down_proj.weight.1.bin              model.layers.35.mlp.down_proj.weight.3.bin
model.layers.17.mlp.down_proj.weight.2.bin              model.layers.35.mlp.gate_proj.weight.0.bin
model.layers.17.mlp.down_proj.weight.3.bin              model.layers.35.mlp.gate_proj.weight.1.bin
model.layers.17.mlp.gate_proj.weight.0.bin              model.layers.35.mlp.gate_proj.weight.2.bin
model.layers.17.mlp.gate_proj.weight.1.bin              model.layers.35.mlp.gate_proj.weight.3.bin
model.layers.17.mlp.gate_proj.weight.2.bin              model.layers.35.mlp.up_proj.weight.0.bin
model.layers.17.mlp.gate_proj.weight.3.bin              model.layers.35.mlp.up_proj.weight.1.bin
model.layers.17.mlp.up_proj.weight.0.bin                model.layers.35.mlp.up_proj.weight.2.bin
model.layers.17.mlp.up_proj.weight.1.bin                model.layers.35.mlp.up_proj.weight.3.bin
model.layers.17.mlp.up_proj.weight.2.bin                model.layers.35.post_attention_layernorm.weight.bin
model.layers.17.mlp.up_proj.weight.3.bin                model.layers.36.attention.dense.weight.0.bin
model.layers.17.post_attention_layernorm.weight.bin     model.layers.36.attention.dense.weight.1.bin
model.layers.18.attention.dense.weight.0.bin            model.layers.36.attention.dense.weight.2.bin
model.layers.18.attention.dense.weight.1.bin            model.layers.36.attention.dense.weight.3.bin
model.layers.18.attention.dense.weight.2.bin            model.layers.36.attention.query_key_value.weight.0.bin
model.layers.18.attention.dense.weight.3.bin            model.layers.36.attention.query_key_value.weight.1.bin
model.layers.18.attention.query_key_value.weight.0.bin  model.layers.36.attention.query_key_value.weight.2.bin
model.layers.18.attention.query_key_value.weight.1.bin  model.layers.36.attention.query_key_value.weight.3.bin
model.layers.18.attention.query_key_value.weight.2.bin  model.layers.36.input_layernorm.weight.bin
model.layers.18.attention.query_key_value.weight.3.bin  model.layers.36.mlp.down_proj.weight.0.bin
model.layers.18.input_layernorm.weight.bin              model.layers.36.mlp.down_proj.weight.1.bin
model.layers.18.mlp.down_proj.weight.0.bin              model.layers.36.mlp.down_proj.weight.2.bin
model.layers.18.mlp.down_proj.weight.1.bin              model.layers.36.mlp.down_proj.weight.3.bin
model.layers.18.mlp.down_proj.weight.2.bin              model.layers.36.mlp.gate_proj.weight.0.bin
model.layers.18.mlp.down_proj.weight.3.bin              model.layers.36.mlp.gate_proj.weight.1.bin
model.layers.18.mlp.gate_proj.weight.0.bin              model.layers.36.mlp.gate_proj.weight.2.bin
model.layers.18.mlp.gate_proj.weight.1.bin              model.layers.36.mlp.gate_proj.weight.3.bin
model.layers.18.mlp.gate_proj.weight.2.bin              model.layers.36.mlp.up_proj.weight.0.bin
model.layers.18.mlp.gate_proj.weight.3.bin              model.layers.36.mlp.up_proj.weight.1.bin
model.layers.18.mlp.up_proj.weight.0.bin                model.layers.36.mlp.up_proj.weight.2.bin
model.layers.18.mlp.up_proj.weight.1.bin                model.layers.36.mlp.up_proj.weight.3.bin
model.layers.18.mlp.up_proj.weight.2.bin                model.layers.36.post_attention_layernorm.weight.bin
model.layers.18.mlp.up_proj.weight.3.bin                model.layers.37.attention.dense.weight.0.bin
model.layers.18.post_attention_layernorm.weight.bin     model.layers.37.attention.dense.weight.1.bin
model.layers.19.attention.dense.weight.0.bin            model.layers.37.attention.dense.weight.2.bin
model.layers.19.attention.dense.weight.1.bin            model.layers.37.attention.dense.weight.3.bin
model.layers.19.attention.dense.weight.2.bin            model.layers.37.attention.query_key_value.weight.0.bin
model.layers.19.attention.dense.weight.3.bin            model.layers.37.attention.query_key_value.weight.1.bin
model.layers.19.attention.query_key_value.weight.0.bin  model.layers.37.attention.query_key_value.weight.2.bin
model.layers.19.attention.query_key_value.weight.1.bin  model.layers.37.attention.query_key_value.weight.3.bin
model.layers.19.attention.query_key_value.weight.2.bin  model.layers.37.input_layernorm.weight.bin
model.layers.19.attention.query_key_value.weight.3.bin  model.layers.37.mlp.down_proj.weight.0.bin
model.layers.19.input_layernorm.weight.bin              model.layers.37.mlp.down_proj.weight.1.bin
model.layers.19.mlp.down_proj.weight.0.bin              model.layers.37.mlp.down_proj.weight.2.bin
model.layers.19.mlp.down_proj.weight.1.bin              model.layers.37.mlp.down_proj.weight.3.bin
model.layers.19.mlp.down_proj.weight.2.bin              model.layers.37.mlp.gate_proj.weight.0.bin
model.layers.19.mlp.down_proj.weight.3.bin              model.layers.37.mlp.gate_proj.weight.1.bin
model.layers.19.mlp.gate_proj.weight.0.bin              model.layers.37.mlp.gate_proj.weight.2.bin
model.layers.19.mlp.gate_proj.weight.1.bin              model.layers.37.mlp.gate_proj.weight.3.bin
model.layers.19.mlp.gate_proj.weight.2.bin              model.layers.37.mlp.up_proj.weight.0.bin
model.layers.19.mlp.gate_proj.weight.3.bin              model.layers.37.mlp.up_proj.weight.1.bin
model.layers.19.mlp.up_proj.weight.0.bin                model.layers.37.mlp.up_proj.weight.2.bin
model.layers.19.mlp.up_proj.weight.1.bin                model.layers.37.mlp.up_proj.weight.3.bin
model.layers.19.mlp.up_proj.weight.2.bin                model.layers.37.post_attention_layernorm.weight.bin
model.layers.19.mlp.up_proj.weight.3.bin                model.layers.38.attention.dense.weight.0.bin
model.layers.19.post_attention_layernorm.weight.bin     model.layers.38.attention.dense.weight.1.bin
model.layers.2.attention.dense.weight.0.bin             model.layers.38.attention.dense.weight.2.bin
model.layers.2.attention.dense.weight.1.bin             model.layers.38.attention.dense.weight.3.bin
model.layers.2.attention.dense.weight.2.bin             model.layers.38.attention.query_key_value.weight.0.bin
model.layers.2.attention.dense.weight.3.bin             model.layers.38.attention.query_key_value.weight.1.bin
model.layers.2.attention.query_key_value.weight.0.bin   model.layers.38.attention.query_key_value.weight.2.bin
model.layers.2.attention.query_key_value.weight.1.bin   model.layers.38.attention.query_key_value.weight.3.bin
model.layers.2.attention.query_key_value.weight.2.bin   model.layers.38.input_layernorm.weight.bin
model.layers.2.attention.query_key_value.weight.3.bin   model.layers.38.mlp.down_proj.weight.0.bin
model.layers.2.input_layernorm.weight.bin               model.layers.38.mlp.down_proj.weight.1.bin
model.layers.2.mlp.down_proj.weight.0.bin               model.layers.38.mlp.down_proj.weight.2.bin
model.layers.2.mlp.down_proj.weight.1.bin               model.layers.38.mlp.down_proj.weight.3.bin
model.layers.2.mlp.down_proj.weight.2.bin               model.layers.38.mlp.gate_proj.weight.0.bin
model.layers.2.mlp.down_proj.weight.3.bin               model.layers.38.mlp.gate_proj.weight.1.bin
model.layers.2.mlp.gate_proj.weight.0.bin               model.layers.38.mlp.gate_proj.weight.2.bin
model.layers.2.mlp.gate_proj.weight.1.bin               model.layers.38.mlp.gate_proj.weight.3.bin
model.layers.2.mlp.gate_proj.weight.2.bin               model.layers.38.mlp.up_proj.weight.0.bin
model.layers.2.mlp.gate_proj.weight.3.bin               model.layers.38.mlp.up_proj.weight.1.bin
model.layers.2.mlp.up_proj.weight.0.bin                 model.layers.38.mlp.up_proj.weight.2.bin
model.layers.2.mlp.up_proj.weight.1.bin                 model.layers.38.mlp.up_proj.weight.3.bin
model.layers.2.mlp.up_proj.weight.2.bin                 model.layers.38.post_attention_layernorm.weight.bin
model.layers.2.mlp.up_proj.weight.3.bin                 model.layers.39.attention.dense.weight.0.bin
model.layers.2.post_attention_layernorm.weight.bin      model.layers.39.attention.dense.weight.1.bin
model.layers.20.attention.dense.weight.0.bin            model.layers.39.attention.dense.weight.2.bin
model.layers.20.attention.dense.weight.1.bin            model.layers.39.attention.dense.weight.3.bin
model.layers.20.attention.dense.weight.2.bin            model.layers.39.attention.query_key_value.weight.0.bin
model.layers.20.attention.dense.weight.3.bin            model.layers.39.attention.query_key_value.weight.1.bin
model.layers.20.attention.query_key_value.weight.0.bin  model.layers.39.attention.query_key_value.weight.2.bin
model.layers.20.attention.query_key_value.weight.1.bin  model.layers.39.attention.query_key_value.weight.3.bin
model.layers.20.attention.query_key_value.weight.2.bin  model.layers.39.input_layernorm.weight.bin
model.layers.20.attention.query_key_value.weight.3.bin  model.layers.39.mlp.down_proj.weight.0.bin
model.layers.20.input_layernorm.weight.bin              model.layers.39.mlp.down_proj.weight.1.bin
model.layers.20.mlp.down_proj.weight.0.bin              model.layers.39.mlp.down_proj.weight.2.bin
model.layers.20.mlp.down_proj.weight.1.bin              model.layers.39.mlp.down_proj.weight.3.bin
model.layers.20.mlp.down_proj.weight.2.bin              model.layers.39.mlp.gate_proj.weight.0.bin
model.layers.20.mlp.down_proj.weight.3.bin              model.layers.39.mlp.gate_proj.weight.1.bin
model.layers.20.mlp.gate_proj.weight.0.bin              model.layers.39.mlp.gate_proj.weight.2.bin
model.layers.20.mlp.gate_proj.weight.1.bin              model.layers.39.mlp.gate_proj.weight.3.bin
model.layers.20.mlp.gate_proj.weight.2.bin              model.layers.39.mlp.up_proj.weight.0.bin
model.layers.20.mlp.gate_proj.weight.3.bin              model.layers.39.mlp.up_proj.weight.1.bin
model.layers.20.mlp.up_proj.weight.0.bin                model.layers.39.mlp.up_proj.weight.2.bin
model.layers.20.mlp.up_proj.weight.1.bin                model.layers.39.mlp.up_proj.weight.3.bin
model.layers.20.mlp.up_proj.weight.2.bin                model.layers.39.post_attention_layernorm.weight.bin
model.layers.20.mlp.up_proj.weight.3.bin                model.layers.4.attention.dense.weight.0.bin
model.layers.20.post_attention_layernorm.weight.bin     model.layers.4.attention.dense.weight.1.bin
model.layers.21.attention.dense.weight.0.bin            model.layers.4.attention.dense.weight.2.bin
model.layers.21.attention.dense.weight.1.bin            model.layers.4.attention.dense.weight.3.bin
model.layers.21.attention.dense.weight.2.bin            model.layers.4.attention.query_key_value.weight.0.bin
model.layers.21.attention.dense.weight.3.bin            model.layers.4.attention.query_key_value.weight.1.bin
model.layers.21.attention.query_key_value.weight.0.bin  model.layers.4.attention.query_key_value.weight.2.bin
model.layers.21.attention.query_key_value.weight.1.bin  model.layers.4.attention.query_key_value.weight.3.bin
model.layers.21.attention.query_key_value.weight.2.bin  model.layers.4.input_layernorm.weight.bin
model.layers.21.attention.query_key_value.weight.3.bin  model.layers.4.mlp.down_proj.weight.0.bin
model.layers.21.input_layernorm.weight.bin              model.layers.4.mlp.down_proj.weight.1.bin
model.layers.21.mlp.down_proj.weight.0.bin              model.layers.4.mlp.down_proj.weight.2.bin
model.layers.21.mlp.down_proj.weight.1.bin              model.layers.4.mlp.down_proj.weight.3.bin
model.layers.21.mlp.down_proj.weight.2.bin              model.layers.4.mlp.gate_proj.weight.0.bin
model.layers.21.mlp.down_proj.weight.3.bin              model.layers.4.mlp.gate_proj.weight.1.bin
model.layers.21.mlp.gate_proj.weight.0.bin              model.layers.4.mlp.gate_proj.weight.2.bin
model.layers.21.mlp.gate_proj.weight.1.bin              model.layers.4.mlp.gate_proj.weight.3.bin
model.layers.21.mlp.gate_proj.weight.2.bin              model.layers.4.mlp.up_proj.weight.0.bin
model.layers.21.mlp.gate_proj.weight.3.bin              model.layers.4.mlp.up_proj.weight.1.bin
model.layers.21.mlp.up_proj.weight.0.bin                model.layers.4.mlp.up_proj.weight.2.bin
model.layers.21.mlp.up_proj.weight.1.bin                model.layers.4.mlp.up_proj.weight.3.bin
model.layers.21.mlp.up_proj.weight.2.bin                model.layers.4.post_attention_layernorm.weight.bin
model.layers.21.mlp.up_proj.weight.3.bin                model.layers.5.attention.dense.weight.0.bin
model.layers.21.post_attention_layernorm.weight.bin     model.layers.5.attention.dense.weight.1.bin
model.layers.22.attention.dense.weight.0.bin            model.layers.5.attention.dense.weight.2.bin
model.layers.22.attention.dense.weight.1.bin            model.layers.5.attention.dense.weight.3.bin
model.layers.22.attention.dense.weight.2.bin            model.layers.5.attention.query_key_value.weight.0.bin
model.layers.22.attention.dense.weight.3.bin            model.layers.5.attention.query_key_value.weight.1.bin
model.layers.22.attention.query_key_value.weight.0.bin  model.layers.5.attention.query_key_value.weight.2.bin
model.layers.22.attention.query_key_value.weight.1.bin  model.layers.5.attention.query_key_value.weight.3.bin
model.layers.22.attention.query_key_value.weight.2.bin  model.layers.5.input_layernorm.weight.bin
model.layers.22.attention.query_key_value.weight.3.bin  model.layers.5.mlp.down_proj.weight.0.bin
model.layers.22.input_layernorm.weight.bin              model.layers.5.mlp.down_proj.weight.1.bin
model.layers.22.mlp.down_proj.weight.0.bin              model.layers.5.mlp.down_proj.weight.2.bin
model.layers.22.mlp.down_proj.weight.1.bin              model.layers.5.mlp.down_proj.weight.3.bin
model.layers.22.mlp.down_proj.weight.2.bin              model.layers.5.mlp.gate_proj.weight.0.bin
model.layers.22.mlp.down_proj.weight.3.bin              model.layers.5.mlp.gate_proj.weight.1.bin
model.layers.22.mlp.gate_proj.weight.0.bin              model.layers.5.mlp.gate_proj.weight.2.bin
model.layers.22.mlp.gate_proj.weight.1.bin              model.layers.5.mlp.gate_proj.weight.3.bin
model.layers.22.mlp.gate_proj.weight.2.bin              model.layers.5.mlp.up_proj.weight.0.bin
model.layers.22.mlp.gate_proj.weight.3.bin              model.layers.5.mlp.up_proj.weight.1.bin
model.layers.22.mlp.up_proj.weight.0.bin                model.layers.5.mlp.up_proj.weight.2.bin
model.layers.22.mlp.up_proj.weight.1.bin                model.layers.5.mlp.up_proj.weight.3.bin
model.layers.22.mlp.up_proj.weight.2.bin                model.layers.5.post_attention_layernorm.weight.bin
model.layers.22.mlp.up_proj.weight.3.bin                model.layers.6.attention.dense.weight.0.bin
model.layers.22.post_attention_layernorm.weight.bin     model.layers.6.attention.dense.weight.1.bin
model.layers.23.attention.dense.weight.0.bin            model.layers.6.attention.dense.weight.2.bin
model.layers.23.attention.dense.weight.1.bin            model.layers.6.attention.dense.weight.3.bin
model.layers.23.attention.dense.weight.2.bin            model.layers.6.attention.query_key_value.weight.0.bin
model.layers.23.attention.dense.weight.3.bin            model.layers.6.attention.query_key_value.weight.1.bin
model.layers.23.attention.query_key_value.weight.0.bin  model.layers.6.attention.query_key_value.weight.2.bin
model.layers.23.attention.query_key_value.weight.1.bin  model.layers.6.attention.query_key_value.weight.3.bin
model.layers.23.attention.query_key_value.weight.2.bin  model.layers.6.input_layernorm.weight.bin
model.layers.23.attention.query_key_value.weight.3.bin  model.layers.6.mlp.down_proj.weight.0.bin
model.layers.23.input_layernorm.weight.bin              model.layers.6.mlp.down_proj.weight.1.bin
model.layers.23.mlp.down_proj.weight.0.bin              model.layers.6.mlp.down_proj.weight.2.bin
model.layers.23.mlp.down_proj.weight.1.bin              model.layers.6.mlp.down_proj.weight.3.bin
model.layers.23.mlp.down_proj.weight.2.bin              model.layers.6.mlp.gate_proj.weight.0.bin
model.layers.23.mlp.down_proj.weight.3.bin              model.layers.6.mlp.gate_proj.weight.1.bin
model.layers.23.mlp.gate_proj.weight.0.bin              model.layers.6.mlp.gate_proj.weight.2.bin
model.layers.23.mlp.gate_proj.weight.1.bin              model.layers.6.mlp.gate_proj.weight.3.bin
model.layers.23.mlp.gate_proj.weight.2.bin              model.layers.6.mlp.up_proj.weight.0.bin
model.layers.23.mlp.gate_proj.weight.3.bin              model.layers.6.mlp.up_proj.weight.1.bin
model.layers.23.mlp.up_proj.weight.0.bin                model.layers.6.mlp.up_proj.weight.2.bin
model.layers.23.mlp.up_proj.weight.1.bin                model.layers.6.mlp.up_proj.weight.3.bin
model.layers.23.mlp.up_proj.weight.2.bin                model.layers.6.post_attention_layernorm.weight.bin
model.layers.23.mlp.up_proj.weight.3.bin                model.layers.7.attention.dense.weight.0.bin
model.layers.23.post_attention_layernorm.weight.bin     model.layers.7.attention.dense.weight.1.bin
model.layers.24.attention.dense.weight.0.bin            model.layers.7.attention.dense.weight.2.bin
model.layers.24.attention.dense.weight.1.bin            model.layers.7.attention.dense.weight.3.bin
model.layers.24.attention.dense.weight.2.bin            model.layers.7.attention.query_key_value.weight.0.bin
model.layers.24.attention.dense.weight.3.bin            model.layers.7.attention.query_key_value.weight.1.bin
model.layers.24.attention.query_key_value.weight.0.bin  model.layers.7.attention.query_key_value.weight.2.bin
model.layers.24.attention.query_key_value.weight.1.bin  model.layers.7.attention.query_key_value.weight.3.bin
model.layers.24.attention.query_key_value.weight.2.bin  model.layers.7.input_layernorm.weight.bin
model.layers.24.attention.query_key_value.weight.3.bin  model.layers.7.mlp.down_proj.weight.0.bin
model.layers.24.input_layernorm.weight.bin              model.layers.7.mlp.down_proj.weight.1.bin
model.layers.24.mlp.down_proj.weight.0.bin              model.layers.7.mlp.down_proj.weight.2.bin
model.layers.24.mlp.down_proj.weight.1.bin              model.layers.7.mlp.down_proj.weight.3.bin
model.layers.24.mlp.down_proj.weight.2.bin              model.layers.7.mlp.gate_proj.weight.0.bin
model.layers.24.mlp.down_proj.weight.3.bin              model.layers.7.mlp.gate_proj.weight.1.bin
model.layers.24.mlp.gate_proj.weight.0.bin              model.layers.7.mlp.gate_proj.weight.2.bin
model.layers.24.mlp.gate_proj.weight.1.bin              model.layers.7.mlp.gate_proj.weight.3.bin
model.layers.24.mlp.gate_proj.weight.2.bin              model.layers.7.mlp.up_proj.weight.0.bin
model.layers.24.mlp.gate_proj.weight.3.bin              model.layers.7.mlp.up_proj.weight.1.bin
model.layers.24.mlp.up_proj.weight.0.bin                model.layers.7.mlp.up_proj.weight.2.bin
model.layers.24.mlp.up_proj.weight.1.bin                model.layers.7.mlp.up_proj.weight.3.bin
model.layers.24.mlp.up_proj.weight.2.bin                model.layers.7.post_attention_layernorm.weight.bin
model.layers.24.mlp.up_proj.weight.3.bin                model.layers.8.attention.dense.weight.0.bin
model.layers.24.post_attention_layernorm.weight.bin     model.layers.8.attention.dense.weight.1.bin
model.layers.25.attention.dense.weight.0.bin            model.layers.8.attention.dense.weight.2.bin
model.layers.25.attention.dense.weight.1.bin            model.layers.8.attention.dense.weight.3.bin
model.layers.25.attention.dense.weight.2.bin            model.layers.8.attention.query_key_value.weight.0.bin
model.layers.25.attention.dense.weight.3.bin            model.layers.8.attention.query_key_value.weight.1.bin
model.layers.25.attention.query_key_value.weight.0.bin  model.layers.8.attention.query_key_value.weight.2.bin
model.layers.25.attention.query_key_value.weight.1.bin  model.layers.8.attention.query_key_value.weight.3.bin
model.layers.25.attention.query_key_value.weight.2.bin  model.layers.8.input_layernorm.weight.bin
model.layers.25.attention.query_key_value.weight.3.bin  model.layers.8.mlp.down_proj.weight.0.bin
model.layers.25.input_layernorm.weight.bin              model.layers.8.mlp.down_proj.weight.1.bin
model.layers.25.mlp.down_proj.weight.0.bin              model.layers.8.mlp.down_proj.weight.2.bin
model.layers.25.mlp.down_proj.weight.1.bin              model.layers.8.mlp.down_proj.weight.3.bin
model.layers.25.mlp.down_proj.weight.2.bin              model.layers.8.mlp.gate_proj.weight.0.bin
model.layers.25.mlp.down_proj.weight.3.bin              model.layers.8.mlp.gate_proj.weight.1.bin
model.layers.25.mlp.gate_proj.weight.0.bin              model.layers.8.mlp.gate_proj.weight.2.bin
model.layers.25.mlp.gate_proj.weight.1.bin              model.layers.8.mlp.gate_proj.weight.3.bin
model.layers.25.mlp.gate_proj.weight.2.bin              model.layers.8.mlp.up_proj.weight.0.bin
model.layers.25.mlp.gate_proj.weight.3.bin              model.layers.8.mlp.up_proj.weight.1.bin
model.layers.25.mlp.up_proj.weight.0.bin                model.layers.8.mlp.up_proj.weight.2.bin
model.layers.25.mlp.up_proj.weight.1.bin                model.layers.8.mlp.up_proj.weight.3.bin
model.layers.25.mlp.up_proj.weight.2.bin                model.layers.8.post_attention_layernorm.weight.bin
model.layers.25.mlp.up_proj.weight.3.bin                model.layers.9.attention.dense.weight.0.bin
model.layers.25.post_attention_layernorm.weight.bin     model.layers.9.attention.dense.weight.1.bin
model.layers.26.attention.dense.weight.0.bin            model.layers.9.attention.dense.weight.2.bin
model.layers.26.attention.dense.weight.1.bin            model.layers.9.attention.dense.weight.3.bin
model.layers.26.attention.dense.weight.2.bin            model.layers.9.attention.query_key_value.weight.0.bin
model.layers.26.attention.dense.weight.3.bin            model.layers.9.attention.query_key_value.weight.1.bin
model.layers.26.attention.query_key_value.weight.0.bin  model.layers.9.attention.query_key_value.weight.2.bin
model.layers.26.attention.query_key_value.weight.1.bin  model.layers.9.attention.query_key_value.weight.3.bin
model.layers.26.attention.query_key_value.weight.2.bin  model.layers.9.input_layernorm.weight.bin
model.layers.26.attention.query_key_value.weight.3.bin  model.layers.9.mlp.down_proj.weight.0.bin
model.layers.26.input_layernorm.weight.bin              model.layers.9.mlp.down_proj.weight.1.bin
model.layers.26.mlp.down_proj.weight.0.bin              model.layers.9.mlp.down_proj.weight.2.bin
model.layers.26.mlp.down_proj.weight.1.bin              model.layers.9.mlp.down_proj.weight.3.bin
model.layers.26.mlp.down_proj.weight.2.bin              model.layers.9.mlp.gate_proj.weight.0.bin
model.layers.26.mlp.down_proj.weight.3.bin              model.layers.9.mlp.gate_proj.weight.1.bin
model.layers.26.mlp.gate_proj.weight.0.bin              model.layers.9.mlp.gate_proj.weight.2.bin
model.layers.26.mlp.gate_proj.weight.1.bin              model.layers.9.mlp.gate_proj.weight.3.bin
model.layers.26.mlp.gate_proj.weight.2.bin              model.layers.9.mlp.up_proj.weight.0.bin
model.layers.26.mlp.gate_proj.weight.3.bin              model.layers.9.mlp.up_proj.weight.1.bin
model.layers.26.mlp.up_proj.weight.0.bin                model.layers.9.mlp.up_proj.weight.2.bin
model.layers.26.mlp.up_proj.weight.1.bin                model.layers.9.mlp.up_proj.weight.3.bin
model.layers.26.mlp.up_proj.weight.2.bin                model.layers.9.post_attention_layernorm.weight.bin
model.layers.26.mlp.up_proj.weight.3.bin                model.lm_head.weight.bin
model.layers.26.post_attention_layernorm.weight.bin     model.wte.weight.bin

Is there code to autotune the cuda kernels like the GPT-J example? how could we modify the GPT-J kernel optimiser to work with LLaMA models?

@SupreethRao99 you can refer to this directory structure, triton server need a version directory name like 1,2...... https://github.com/triton-inference-server/fastertransformer_backend/tree/main/all_models/gptj

@void-main failed to early stop with end_id

@void-main failed to early stop with end_id

This MR fix the bug. https://github.com/NVIDIA/FasterTransformer/pull/584/commits/622af28de55a09a253a23945d22f3015def49713

@Lzhang-hub I followed the instructions and I'm getting the following error now

root@eda821372bac:/workspace# /opt/tritonserver/bin/tritonserver --model-repository=./models/vicuna-13b
I0506 07:54:26.454366 15432 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f758a000000' with size 268435456
I0506 07:54:26.458193 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0506 07:54:26.458208 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0506 07:54:26.458213 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0506 07:54:26.458225 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
W0506 07:54:26.876510 15432 server.cc:237] failed to enable peer access for some device pairs
I0506 07:54:26.889064 15432 model_lifecycle.cc:459] loading: fastertransformer:1
I0506 07:54:27.149907 15432 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer
I0506 07:54:27.149962 15432 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.12
I0506 07:54:27.149979 15432 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.12
I0506 07:54:27.828205 15432 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0506 07:54:27.829096 15432 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1
I0506 07:54:27.829118 15432 libfastertransformer.cc:402] Sequence Batching: disabled
I0506 07:54:27.829131 15432 libfastertransformer.cc:412] Dynamic Batching: disabled
I0506 07:54:27.829299 15432 libfastertransformer.cc:1899] TRITONBACKEND_ModelFinalize: delete model state
I0506 07:54:27.829311 15432 libfastertransformer.cc:1904] TRITONBACKEND_ModelFinalize: MPI Finalize
E0506 07:54:27.883287 15432 model_lifecycle.cc:597] failed to load 'fastertransformer' version 1: Unsupported: Unknown model "vicuna-13b"
I0506 07:54:27.883453 15432 server.cc:583] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0506 07:54:27.883517 15432 server.cc:610] 
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| Backend           | Path                                                | Config                                              |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtri | {"cmdline":{"auto-complete-config":"true","backend- |
|                   | ton_fastertransformer.so                            | directory":"/opt/tritonserver/backends","min-comput |
|                   |                                                     | e-capability":"6.000000","default-max-batch-size":" |
|                   |                                                     | 4"}}                                                |
|                   |                                                     |                                                     |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+

I0506 07:54:27.883574 15432 server.cc:653] 
+-------------------+---------+------------------------------------------------------+
| Model             | Version | Status                                               |
+-------------------+---------+------------------------------------------------------+
| fastertransformer | 1       | UNAVAILABLE: Unsupported: Unknown model "vicuna-13b" |
+-------------------+---------+------------------------------------------------------+

I0506 07:54:27.933348 15432 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4
I0506 07:54:27.933420 15432 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4
I0506 07:54:27.933434 15432 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4
I0506 07:54:27.933451 15432 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4
I0506 07:54:27.934718 15432 metrics.cc:701] Collecting CPU metrics
I0506 07:54:27.935008 15432 tritonserver.cc:2387] 
+----------------------------------+----------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                       |
| server_version                   | 2.33.0                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy |
|                                  |  model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s |
|                                  | tatistics trace logging                                                                      |
| model_repository_path[0]         | ./models/vicuna-13b                                                                          |
| model_control_mode               | MODE_NONE                                                                                    |
| strict_model_config              | 0                                                                                            |
| rate_limit                       | OFF                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                     |
| min_supported_compute_capability | 6.0                                                                                          |
| strict_readiness                 | 1                                                                                            |
| exit_timeout                     | 30                                                                                           |
| cache_enabled                    | 0                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------+

I0506 07:54:27.935044 15432 server.cc:284] Waiting for in-flight requests to complete.
I0506 07:54:27.935056 15432 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences
I0506 07:54:27.935065 15432 server.cc:315] All models are stopped, unloading models
I0506 07:54:27.935077 15432 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

my models are now stored under workspace/models/vicuna-13b/fastertransformer. that folder contains config.pbtxt and a directory named 1 with all the model weights. my new config.pbtxt contains

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "vicuna-13b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_reset_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "4"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "vicuna-13b"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/workspace/models/vicuna-13b/fastertransformer/1/"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

@void-main failed to early stop with end_id

This MR fix the bug. 622af28

got it, merging it now

@Lzhang-hub I followed the instructions and I'm getting the following error now

root@eda821372bac:/workspace# /opt/tritonserver/bin/tritonserver --model-repository=./models/vicuna-13b
I0506 07:54:26.454366 15432 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f758a000000' with size 268435456
I0506 07:54:26.458193 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0506 07:54:26.458208 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 1 with size 67108864
I0506 07:54:26.458213 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 2 with size 67108864
I0506 07:54:26.458225 15432 cuda_memory_manager.cc:105] CUDA memory pool is created on device 3 with size 67108864
W0506 07:54:26.876510 15432 server.cc:237] failed to enable peer access for some device pairs
I0506 07:54:26.889064 15432 model_lifecycle.cc:459] loading: fastertransformer:1
I0506 07:54:27.149907 15432 libfastertransformer.cc:1828] TRITONBACKEND_Initialize: fastertransformer
I0506 07:54:27.149962 15432 libfastertransformer.cc:1838] Triton TRITONBACKEND API version: 1.12
I0506 07:54:27.149979 15432 libfastertransformer.cc:1844] 'fastertransformer' TRITONBACKEND API version: 1.12
I0506 07:54:27.828205 15432 libfastertransformer.cc:1876] TRITONBACKEND_ModelInitialize: fastertransformer (version 1)
I0506 07:54:27.829096 15432 libfastertransformer.cc:372] Instance group type: KIND_CPU count: 1
I0506 07:54:27.829118 15432 libfastertransformer.cc:402] Sequence Batching: disabled
I0506 07:54:27.829131 15432 libfastertransformer.cc:412] Dynamic Batching: disabled
I0506 07:54:27.829299 15432 libfastertransformer.cc:1899] TRITONBACKEND_ModelFinalize: delete model state
I0506 07:54:27.829311 15432 libfastertransformer.cc:1904] TRITONBACKEND_ModelFinalize: MPI Finalize
E0506 07:54:27.883287 15432 model_lifecycle.cc:597] failed to load 'fastertransformer' version 1: Unsupported: Unknown model "vicuna-13b"
I0506 07:54:27.883453 15432 server.cc:583] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0506 07:54:27.883517 15432 server.cc:610] 
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| Backend           | Path                                                | Config                                              |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+
| fastertransformer | /opt/tritonserver/backends/fastertransformer/libtri | {"cmdline":{"auto-complete-config":"true","backend- |
|                   | ton_fastertransformer.so                            | directory":"/opt/tritonserver/backends","min-comput |
|                   |                                                     | e-capability":"6.000000","default-max-batch-size":" |
|                   |                                                     | 4"}}                                                |
|                   |                                                     |                                                     |
+-------------------+-----------------------------------------------------+-----------------------------------------------------+

I0506 07:54:27.883574 15432 server.cc:653] 
+-------------------+---------+------------------------------------------------------+
| Model             | Version | Status                                               |
+-------------------+---------+------------------------------------------------------+
| fastertransformer | 1       | UNAVAILABLE: Unsupported: Unknown model "vicuna-13b" |
+-------------------+---------+------------------------------------------------------+

I0506 07:54:27.933348 15432 metrics.cc:808] Collecting metrics for GPU 0: Tesla T4
I0506 07:54:27.933420 15432 metrics.cc:808] Collecting metrics for GPU 1: Tesla T4
I0506 07:54:27.933434 15432 metrics.cc:808] Collecting metrics for GPU 2: Tesla T4
I0506 07:54:27.933451 15432 metrics.cc:808] Collecting metrics for GPU 3: Tesla T4
I0506 07:54:27.934718 15432 metrics.cc:701] Collecting CPU metrics
I0506 07:54:27.935008 15432 tritonserver.cc:2387] 
+----------------------------------+----------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                       |
| server_version                   | 2.33.0                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy |
|                                  |  model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters s |
|                                  | tatistics trace logging                                                                      |
| model_repository_path[0]         | ./models/vicuna-13b                                                                          |
| model_control_mode               | MODE_NONE                                                                                    |
| strict_model_config              | 0                                                                                            |
| rate_limit                       | OFF                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{1}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{2}    | 67108864                                                                                     |
| cuda_memory_pool_byte_size{3}    | 67108864                                                                                     |
| min_supported_compute_capability | 6.0                                                                                          |
| strict_readiness                 | 1                                                                                            |
| exit_timeout                     | 30                                                                                           |
| cache_enabled                    | 0                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------+

I0506 07:54:27.935044 15432 server.cc:284] Waiting for in-flight requests to complete.
I0506 07:54:27.935056 15432 server.cc:300] Timeout 30: Found 0 model versions that have in-flight inferences
I0506 07:54:27.935065 15432 server.cc:315] All models are stopped, unloading models
I0506 07:54:27.935077 15432 server.cc:322] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

my models are now stored under workspace/models/vicuna-13b/fastertransformer. that folder contains config.pbtxt and a directory named 1 with all the model weights. my new config.pbtxt contains

name: "fastertransformer"
backend: "fastertransformer"
default_model_filename: "vicuna-13b"
max_batch_size: 1024

model_transaction_policy {
  decoupled: False
}

input [
  {
    name: "input_ids"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "start_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_search_diversity_rate"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "is_return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
  },
  {
    name: "prompt_learning_task_name_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_decay"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_min"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "top_p_reset_ids"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_UINT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_UINT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_CPU
  }
]
parameters {
  key: "tensor_para_size"
  value: {
    string_value: "4"
  }
}
parameters {
  key: "pipeline_para_size"
  value: {
    string_value: "1"
  }
}
parameters {
  key: "data_type"
  value: {
    string_value: "fp16"
  }
}
parameters {
  key: "model_type"
  value: {
    string_value: "vicuna-13b"
  }
}
parameters {
  key: "model_checkpoint_path"
  value: {
    string_value: "/workspace/models/vicuna-13b/fastertransformer/1/"
  }
}
parameters {
  key: "enable_custom_all_reduce"
  value: {
    string_value: "0"
  }
}

@SupreethRao99 , try change the model_type to Llama in your vicuna-13b/fastertransformer/config.pbtxt

NVIDIA / FasterTransformer

LLaMA support #506