NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.81k stars 889 forks source link

Performance Degradation when using FP16 #175

Closed shimoshida closed 2 years ago

shimoshida commented 2 years ago

Information

I want to perform GPT-J model in fp16 precision(https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16) on FasterTransformer + Triton, but I have a trouble with the accuracy. For example, the following sentences are generated when following sample scripts with FasterTransformer.

Environment

To reproduce

  1. download gpt-j-6b model in float 16 pytorch_model.bin, rename gpt-j.pt, and store gpt-j/gpt-j.pt. https://huggingface.co/EleutherAI/gpt-j-6B/tree/float16

  2. convert pytorch model to fasterTransformer format via the following scripts on docker image nvcr.io/nvidia/pytorch:21.07-py3.

from argparse import ArgumentParser
from io import BytesIO
from os import makedirs
import numpy as np
import torch

torch.set_printoptions(linewidth=130, sci_mode=False)
np.set_printoptions(linewidth=130, suppress=True)

def reshard(x, old_shape):
    import jax.numpy as jnp
    if len(x.shape) == 1:
        out = x[0:1]

    elif len(x.shape) == 2:
        if (x[1:] == x[-1]).all():
            if (x[1:] == 0).all() or (x[1:] == 1).all():
                out = x[0:1]
            else:
                out = x[0:1] * 8#* x.shape[0] / old_shape[0]
        else:
            out = x.reshape(old_shape)

    elif len(x.shape) == 3:
        if x.shape[0] * x.shape[2] == old_shape[2]:
            out = jnp.transpose(x, (1, 0, 2)).reshape(old_shape)
        elif x.shape[0] * x.shape[1] == old_shape[1]:
            out = x.reshape(old_shape)
        else:
            raise Exception(f"unimplemented, {x.shape}, {old_shape}")
    else:
        raise Exception(f"unimplemented, {x}")
    return out

def get_old_shape(t, dim=2):
    if len(t.shape) == 3:
        shard_shape = t.shape
        if dim == 1:
            return (shard_shape[0] * shard_shape[1], shard_shape[2])
        elif dim == 2:
            return (shard_shape[1], shard_shape[0] * shard_shape[2])
        else:
            raise ValueError(f"unsupported dim {dim}")
    if len(t.shape) == 2:
        return (t.shape[1] * t.shape[0],)
    else:
        raise ValueError(f"unsupported shape {t.shape}")

def read_shard(ckpt_dir, idx):
    out = []
    file_path = ckpt_dir + f"{idx}.npz"
    with open(file_path, "rb") as f:
        buf = f.read()
        f_io = BytesIO(buf)
        deserialized = np.load(f_io)
        for i in deserialized:
            out.append(deserialized[i])
    return out

def savebin(param, save_path):
    if isinstance(param, torch.Tensor):
        param = param.cpu().float().numpy()
    np.squeeze(param).astype(np.float32).tofile(save_path + ".bin")

def param2file(pt_param, layer_id, save_dir, dest_key):
    base_n = save_dir + "/model.layers." + str(layer_id) + "."
    save_path = base_n + dest_key
    savebin(pt_param, save_path)

def param2distributed(
    pt_param,
    layer_id,
    save_dir,
    dest_key,
    n_inference_gpus,
    split_axis,
):
    np_param = pt_param.cpu().float().numpy()
    base_n = save_dir + "/model.layers." + str(layer_id) + "."
    save_path = base_n + dest_key
    split_param = np.split(np_param, n_inference_gpus, axis=split_axis)
    for i, p in enumerate(split_param):
        savebin(p, save_path + f".{i}")

def save(w, save_dir, n_inference_gpus=1, num_layers=28):
    makedirs(save_dir, exist_ok=True)
    savebin(w['transformer.wte.weight'], save_dir + "/model.wte")
    for l in range(num_layers):
        print(f"Saving layer {l} / 28")
        base_k = "transformer.h." + str(l) + "."
        param2file(
          w[base_k + "ln_1.bias"],
          l, save_dir, "input_layernorm.bias"
        )
        param2file(
          w[base_k + "ln_1.weight"],
          l, save_dir, "input_layernorm.weight"
        )
        param2distributed(
          w[base_k + "mlp.fc_in.weight"].T, # fc_in weight
          l, save_dir, "mlp.dense_h_to_4h.weight",
          n_inference_gpus, split_axis=-1 # split fast indx
        )
        param2distributed(
          w[base_k + "mlp.fc_in.bias"], # fc_in bias
          l, save_dir, "mlp.dense_h_to_4h.bias",
          n_inference_gpus, split_axis=-1 # split fast indx
        )

        param2distributed(
          w[base_k + "mlp.fc_out.weight"].T, # fc_out weight
          l, save_dir, "mlp.dense_4h_to_h.weight",
          n_inference_gpus, split_axis=0  # split slow indx
        )
        param2file(
          w[base_k + "mlp.fc_out.bias"], # fc_out bias
          l, save_dir, "mlp.dense_4h_to_h.bias"
        )
        param2distributed(
          w[base_k + "attn.out_proj.weight"].T,
          l, save_dir, "attention.dense.weight",
          n_inference_gpus, split_axis=0  # split slow indx
        )
        QKV_w = torch.stack([
          w[base_k + "attn.q_proj.weight"],
          w[base_k + "attn.k_proj.weight"],
          w[base_k + "attn.v_proj.weight"],
        ]) # [qkv, n_heads * dim_head, latent_space]
        QKV_w = QKV_w.permute(2, 0, 1)
        param2distributed(
          QKV_w, l, save_dir, "attention.query_key_value.weight",
          n_inference_gpus, split_axis=-1 # split fast indx
        )
        # Other unneeded per-layer params:
        # attn.attention.masked_bias = torch.tensor(-1e9)
        # attn.attention.bias = torch.tril(torch.ones(1, 1, 2048, 2048))
    savebin(w['transformer.ln_f.weight'], save_dir + "/model.final_layernorm.weight")
    savebin(w['transformer.ln_f.bias'], save_dir + "/model.final_layernorm.bias")
    # lm head fast index should be hidden layer size, not vocab:
    savebin(w['lm_head.weight'], save_dir + "/model.lm_head.weight")
    savebin(w['lm_head.bias'], save_dir + "/model.lm_head.bias")

if __name__ == "__main__":
    parser = ArgumentParser(
        description="Convert GPT-J slim checkpoint to FasterTransformer",
    )
    parser.add_argument(
        "--to", default="triton-model-store/fastertransformer/1/gpt-j-6b/"
    )
    parser.add_argument(
        "--f", default="gpt-j/gpt-j.pt"
    )
    args = parser.parse_args()

    print("loading")
    in_path = args.f
    output_dir = args.to

    if len(in_path)>3 and in_path[-3:] == ".pt":
        checkpoint = torch.load(in_path)
    else:
        raise ValueError("plz give **.pt file")

    print("saving")
    save(checkpoint, output_dir)
    print("done")

  1. Add the following config.pbtxt to triton-model-store/fastertransformer/config.pbtxt. I note that I have changed temperature to 0.9 from sample GPT-J config.
``` name: "fastertransformer" backend: "fastertransformer" default_model_filename: "gpt-j-6b" max_batch_size: 128 input [ { name: "INPUT_ID" data_type: TYPE_UINT32 dims: [ -1, -1 ] }, { name: "REQUEST_INPUT_LEN" data_type: TYPE_UINT32 dims: [ 1 ] }, { name: "REQUEST_OUTPUT_LEN" data_type: TYPE_UINT32 dims: [ 1 ] } ] output [ { name: "OUTPUT0" data_type: TYPE_UINT32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters { key: "top_k" value: { string_value: "1" } } parameters { key: "top_p" value: { string_value: "0.0" } } parameters { key: "tensor_para_size" value: { string_value: "1" } } parameters { key: "pipeline_para_size" value: { string_value: "1" } } parameters { key: "max_input_len" value: { string_value: "512" } } parameters { key: "max_seq_len" value: { string_value: "528" } } parameters { key: "is_half" value: { string_value: "1" } } parameters { key: "head_num" value: { string_value: "16" } } parameters { key: "size_per_head" value: { string_value: "256" } } parameters { key: "inter_size" value: { string_value: "16384" } } parameters { key: "rotary_embedding" value: { string_value: "64" } } parameters { key: "vocab_size" value: { string_value: "50400" } } parameters { key: "start_id" value: { string_value: "50256" } } parameters { key: "end_id" value: { string_value: "50256" } } parameters { key: "decoder_layers" value: { string_value: "28" } } parameters { key: "model_name" value: { string_value: "gpt-j-6b" } } parameters { key: "beam_width" value: { string_value: "1" } } parameters { key: "temperature" value: { string_value: "0.9" } } parameters { key: "repetition_penalty" value: { string_value: "1.0" } } parameters { key: "len_penalty" value: { string_value: "1.0" } } parameters { key: "beam_search_diversity_rate" value: { string_value: "0.0" } } dynamic_batching { preferred_batch_size: [4, 8] max_queue_delay_microseconds: 200000 } parameters { key: "model_type" value: { string_value: "GPT-J" } } ```
  1. Build & run the following docker images
``` FROM nvcr.io/nvidia/tritonserver:21.07-py3 ARG work_dir="/workspace" ARG lib_dir="/opt/tritonserver" WORKDIR ${lib_dir} # settings RUN apt-get update RUN apt-get install --yes python3-dev \ rapidjson-dev RUN wget https://github.com/Kitware/CMake/releases/download/v3.21.1/cmake-3.21.1-linux-x86_64.tar.gz RUN tar -axf cmake-3.21.1-linux-x86_64.tar.gz ENV PATH=${lib_dir}/cmake-3.21.1-linux-x86_64/bin/:$PATH RUN pip3 install tritonclient[all] fire regex RUN git clone https://github.com/triton-inference-server/fastertransformer_backend.git -b dev/v1.1_beta RUN git clone https://github.com/NVIDIA/FasterTransformer.git -b dev/v5.0_beta RUN git clone https://github.com/triton-inference-server/server.git # We need some tools when we test this backend RUN ln -s server/qa/common . ENV CONTAINER_VERSION=21.07 ENV TRITON_DOCKER_IMAGE=triton_with_ft:${CONTAINER_VERSION} # install ft backend RUN mkdir -p fastertransformer_backend/build WORKDIR /opt/tritonserver/fastertransformer_backend/build RUN cmake -DCMAKE_EXPORT_COMPILE_COMMANDS=1 \ -DCMAKE_BUILD_TYPE=Release \ -DCMAKE_INSTALL_PREFIX=/opt/tritonserver \ -DTRITON_COMMON_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -DTRITON_CORE_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" \ -DTRITON_BACKEND_REPO_TAG="r${NVIDIA_TRITON_SERVER_VERSION}" .. RUN make -j install # model file settings WORKDIR ${work_dir} RUN pip3 install transformers ```

docker run --gpus all --rm -it \ -p 8000:8000 -p 8001:8001 -p 8002:8002 \ -v $(curr_dir):/workspace \ $(tag) \ bash

  1. In the above image, run Triton

    mpirun -n 1 --allow-run-as-root tritonserver \ --model-repository=/workspace/triton-model-store &

  2. run the following chat.py via python3 chat.py

``` #!/usr/bin/python import argparse import numpy as np import os import re import sys import requests as httpreq from builtins import range import statistics as s import tritonclient.http as httpclient from tritonclient.utils import np_to_triton_dtype from transformers import AutoTokenizer def inference(input_data: np.ndarray, fixed_output_len: int) -> np.ndarray: """ input_data: (batch_size, 1, sentence_len) """ model_name = "fastertransformer" # shape input_len = np.array([[sentence.size] for sentence in input_data], np.uint32) output_len = np.ones_like(input_len).astype(np.uint32) * fixed_output_len with httpclient.InferenceServerClient( "localhost:8000", concurrency=1, verbose=True ) as client: inputs = [ httpclient.InferInput("INPUT_ID", input_data.shape, np_to_triton_dtype(input_data.dtype)), httpclient.InferInput("REQUEST_INPUT_LEN", input_len.shape, np_to_triton_dtype(input_len.dtype)), httpclient.InferInput("REQUEST_OUTPUT_LEN", output_len.shape, np_to_triton_dtype(output_len.dtype)) ] inputs[0].set_data_from_numpy(input_data) inputs[1].set_data_from_numpy(input_len) inputs[2].set_data_from_numpy(output_len) # requests.append(client.async_infer(model_name, inputs)) print("send request") result = client.infer(model_name, inputs) return result.as_numpy("OUTPUT0") def gpt_j(): tokenizer = AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B") prompt = "The Belgian national football team " tokens = tokenizer(prompt, return_tensors="np").input_ids.astype(np.uint32) tokens = tokens.reshape((1, 1, -1)) FIXED_OUTPUT_LEN = 200 last_tokens = inference(tokens, FIXED_OUTPUT_LEN) generated_text = tokenizer.decode(last_tokens[0][0]) print("Generated:", generated_text) def main(): gpt_j() if __name__ == '__main__': main() ```

Expected Behavior

The following reference outputs accurate sentences.

Ref: https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/GPT-J-6B/Inference_with_GPT_J_6B.ipynb#scrollTo=RdOynYcY8jb1

output:

The Belgian national football team  (,, ), known as the Blue and Whites, represents Belgium in international football competitions organised by FIFA and the governing body for football in Belgium, the Royal Belgian Football Association (,, ). It is coached by former Netherlands international and UEFA Euro 1984 winner Dick Advocaat, who was appointed in January 2018 after the departure of Michel Preud'homme. The Belgium team has been a force in international football since the 1960s, winning the 1974 FIFA World Cup and Euro 2000. It also qualified for UEFA Euro 2020. The Belgium national team is based and plays its games in the Antwerp region, with Rupelstad Stadion, home to its first- and second-tier matches, as a regular venue.

Belgium played its first official international match on 21 January 1920, losing 0–2 to the Netherlands in Rotterdam. Belgium and the Netherlands have played each other in 15 matches, with the Dutch winning 10 times and

Related Issue

NVIDIA/FasterTransformer#172

byshiue commented 2 years ago

Do you get correct result under FP32? I try to repoduce your problem but fail. By using your promt, I get the following ouptut on both FP32 and FP16 when output length is set to 64.

The Belgian national football team  is the official name of a selection made by Belgium's Football Federation (, , ) to play in international matches. The current head coach and manager are Roberto Martínez who took over from Marc Wilmots on 1 June 2019 after he was sacked following their exit at Euro 2020 qualifying Group A stage

I don't use the FP16 checkpoint because the converter will convert it back to FP32.

shimoshida commented 2 years ago

@byshiue Thank you for your quick response!

Do you get correct result under FP32?

I am trying to test accuracy using FP32, but my objective is to use FP16 checkpoint because some huggingface repository provides only FP16 for some models, e.g., https://huggingface.co/NovelAI/genji-jp/tree/main.

I don't use the FP16 checkpoint because the converter will convert it back to FP32.

Yes, I am aware that tensors are converted to FP32 before save. I also tried to save tensors in FP16, but FasterTransformers cannot load such tensors...

[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.wte.bin only has 412876800, but request 825753600, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.final_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.final_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.lm_head.weight.bin only has 412876800, but request 825753600, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.lm_head.bias.bin only has 100800, but request 201600, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.0.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.1.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.2.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.3.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.4.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.5.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.6.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.7.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.8.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.9.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.10.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.11.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.12.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.13.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.14.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.15.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
root@579c52e6ce01:/workspace# [WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
root@579c52e6ce01:/workspace# [WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.16.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
root@579c52e6ce01:/workspace# [WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.17.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.18.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.19.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.20.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.21.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.22.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.23.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.24.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.25.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.26.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.input_layernorm.bias.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.input_layernorm.weight.bin only has 8192, but request 16384, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.attention.query_key_value.weight.0.bin only has 100663296, but request 201326592, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.attention.dense.weight.0.bin only has 33554432, but request 67108864, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.mlp.dense_h_to_4h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.mlp.dense_h_to_4h.bias.0.bin only has 32768, but request 65536, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.mlp.dense_4h_to_h.weight.0.bin only has 134217728, but request 268435456, loading model fails!
[WARNING] file /workspace/triton-model-store/fastertransformer/1/gpt-j-6b/model.layers.27.mlp.dense_4h_to_h.bias.bin only has 8192, but request 16384, loading model fails!

You mean that FasterTransformer convert FP32 parameters to FP16 dynamically when FP16 inference mode? If so, FasterTransformer does not support FP16 checkpoint?

byshiue commented 2 years ago

Yes. FT assumes that the checkpoint is always under FP32. If you set is_half=1, FT will convert the model to FP16 during loading model. As I know, the model you provide above contains both FP32 and FP16 weights, we can test them first to make sure that the additional casting during convertion does not affect the result. If the casting in the convertion may affect the results, you can try to modify the loading function loadWeightFromBin in memory_utils.h to load the FP16 weight.

shimoshida commented 2 years ago

@byshiue Sure. I'll try to modify memory_utils.h. btw, I have finished testing accuracy with FP32 checkpoint, but failed... I used the following FP32 checkpoint, rename model.pt, and follows the procedure mentioned above.

https://huggingface.co/EleutherAI/gpt-j-6B/tree/main

Which weights are needed to generate the sentences you generated?

byshiue commented 2 years ago

What's your meaning of "failed"? Cannot conver the model or cannot generate correct result? I follow the gptj_guide.md to donwload and convert the model.

shimoshida commented 2 years ago

@byshiue "failed" means that generated sentences are not correct like here.

Generated: The Belgian national football team is the national football team of Belgium. It is controlled by the Belgian Football Association of the Belgian Football Association of Football Association (Federation of Football Federation (Federation of Wallonia, the Belgian Football Association (Federation (Federation (Federation (Federation (Federation) and the Belgian Football Association (F

I follow the gptj_guide.md to donwload and convert the model.

Oh, I see. If you don't mind, could you please try the reproduction method I described? FasterTransformer doesn't seem to work well if you simply convert the Chechpoint provided by Huggingface.

byshiue commented 2 years ago

Sorry, I misunderstand something. As you say, the converter of FT does not support the checkpoint of Huggingface. So, if you want to load the model of Huggingface, you need to modify the converter. Thus, I think it is not a problem for precision, but the converting.

bharatv007 commented 2 years ago

GPT-J also has a different architecture to vanilla decoder transformer. Does FasterTransformer support different architectures too?

byshiue commented 2 years ago

FT supports GPT-J, standard encoder-decoder, BERT, longformer and T5.

shimoshida commented 2 years ago

@byshiue

So, if you want to load the model of Huggingface, you need to modify the converter. Thus, I think it is not a problem for precision, but the converting.

The difference between huggingface and original gpt-j-6b is just layer names, right? I have checked differences of values after converting parameters in *.bin format and confirmed that both values are almost same.

huggingface: [-0.00404739 0.01963806 -0.00400543 -0.00257874 -0.00688553 0.02464294 0.01841736 -0.02111816 0.0171814 -0.00888824] original: [-0.00405884 0.01960754 -0.00401306 -0.00258636 -0.0068779 0.0246582 0.01843262 -0.02108765 0.01715088 -0.00888824]

You mean that the following conversion script from *.pt to *.bin is wrong? Name mapping is successful by using the mentioned script.

https://github.com/NVIDIA/FasterTransformer/blob/dev/v5.0_beta/examples/pytorch/gptj/utils/gptj_ckpt_convert.py

byshiue commented 2 years ago

The difference between huggingface and original gpt-j-6b is just layer names, right? I don't know. If you think the difference is just layer name, you can try

  1. Run the original gpt-j by FT to verify the correctness. (We have tested this case and it should work)
  2. Convert the huggingface gpt-j to original gpt-j format by name mapping and verify both checkpoints are same. If 1 and 2 are correct, then you should be able to run huggingface gpt-j.
byshiue commented 2 years ago

Hi, can you try the tag dev/v5.0_beta_2021.09_tag?

shimoshida commented 2 years ago

@byshiue Thank you for the information! As you mentioned, I have obtained accurate outputs by using dev/v5.0_beta_2021.09_tag. Thank you for your help!

byshiue commented 2 years ago

Sorry, it seems there are some bugs in the latest codes, I will fix it as soon as possible.