NVIDIA / FasterTransformer

Transformer related optimization, including BERT, GPT
Apache License 2.0
5.86k stars 891 forks source link

question about gpt_op.cc in tf op #86

Closed gyin94 closed 3 years ago

gyin94 commented 3 years ago

I am trying to change the gpt_op.cc to be similar with gpt.h in torch op and hence use the start_ids and attention_mask. But I got the following error. Any idea or suggestion?

this->get_tensor(context, 21, &decoding_params.d_attn_mask);
image

error:

/workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc: In instantiation of ‘void tensorflow::{anonymous}::DecodingGPTOp<Device, T>::Compute(tensorflow::OpKernelContext*) [with Device = Eigen::GpuDevice; T = Eigen::half]’:
/workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:106:10:   required from here
/workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:205:9: error: no matching function for call to ‘tensorflow::{anonymous}::DecodingGPTOp<Eigen::GpuDevice, Eigen::half>::get_tensor(tensorflow::OpKernelContext*&, int, __half**)’
  205 |         this->get_tensor(context, 21, &decoding_params.d_attn_mask);
      |         ^~~~
In file included from /workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:23:
/workspace/FasterTransformer/fastertransformer/tf_op/common_op.h:60:8: note: candidate: ‘template<class DataType_> void tensorflow::{anonymous}::CommonOp<T>::get_tensor(tensorflow::OpKernelContext*, int, const DataType_**, int) [with DataType_ = DataType_; T = Eigen::half]’
   60 |   void get_tensor(OpKernelContext *context, int tensor_id, const DataType_** tensor_ptr, int off_set = 0){
      |        ^~~~~~~~~~
/workspace/FasterTransformer/fastertransformer/tf_op/common_op.h:60:8: note:   template argument deduction/substitution failed:
/workspace/FasterTransformer/fastertransformer/tf_op/gpt_op.cc:205:9: note:   types ‘const DataType_’ and ‘__half’ have incompatible cv-qualifiers
  205 |         this->get_tensor(context, 21, &decoding_params.d_attn_mask);
byshiue commented 3 years ago

Since d_attn_mask is declared as "T", but not "const T". You can try to modify the declaration to "const T*" directly to fix this problem.

gyin94 commented 3 years ago

Thanks. It solves the problem. @byshiue can I ask why we decide to initialize transformer weights in Compute instead of Constructor in tf_op? However transformer weights initialization is in Constructor for th_op and C++ interface.

byshiue commented 3 years ago

Because we cannot get the weight of tensorflow during constructor.

gyin94 commented 3 years ago

Do you mean we can't add the params and decoding_params in the private variables section in TF op and initialize them in the constructor? Since TF op can't retrieve and use params in the Compute section?

torch op:

private:
  ...
  const int max_batch_size_;
  DecoderInitParam<T> *param;
  DecodingInitParam<T> decoding_params;
};
byshiue commented 3 years ago

We can not be sure that the weights are same every time.

gyin94 commented 3 years ago

We can not be sure that the weights are same every time.

I am a little confused. params is the model weight. They will be same as long as it is the same model. FasterTransformer torch op and C++ interface did the weight initialization in the Constructor instead of Compute. Why do we make the TF op different from them?

If I understand it correctly, the variant would be batch_size, start_ids and attention_mask for requests and computing instead of model weights (after initialization)

byshiue commented 3 years ago

In TensorFlow, we cannot get the weights during constructor. So we set the weight during compute. Users can also change the weight during different computing if they need. And the setting does not bring overhead.

gyin94 commented 3 years ago

@byshiue This is wired. Pytorch op is way faster than tf op (1.6x). Check below two modified python files to reproduce. They use the exact same decoding parameters and input. (I also noticed this big latency difference with context enabled tf op and torch op). Any idea or suggestion on this circumstance?

model: openai/gpt2/124M fp16=true, max_seq_len=8, output_len=7, start_ids=[0] and batch_size=1. latency: pytorch op 13.9ms, tf op 22.3ms

./bin/gpt_gemm 1 1 12 64 50257 8 1 1

tensorflow/gpt_sample.py


import fire
import json
import os
import numpy as np
import tensorflow as tf

from tensorflow.contrib.training import HParams
import sys
sys.path.append("../sample")
import pytorch.utils.gpt_token_encoder as encoder
from utils.common import TransformerArgument
from utils.common import DecodingGpt2Argument
from utils.common import time_test
from utils.encoder import build_sequence_mask

def sample_model(
    model_name='124M',
    nsamples=1,
    batch_size=1,
    max_seq_len=8,
    temperature=1,
    top_k=1,
    top_p=0,
    models_dir='models',
    data_type='fp32',
    time=True,
):
    """Run the sample_model.

    :model_name=124M : String, which model to use
    :nsamples=0 : Number of samples to return, if 0, continues to
     generate samples indefinately.
    :batch_size=1 : Number of batches (only affects speed/memory).
    :length=None : Number of tokens in generated text, if None (default), is
     determined by model hyperparameters
    :temperature=1 : Float value controlling randomness in boltzmann
     distribution. Lower temperature results in less random completions. As the
     temperature approaches zero, the model will become deterministic and
     repetitive. Higher temperature results in more random completions.
    :top_k=4 : Integer value controlling diversity. 1 means only 1 word is
     considered for each step (token), resulting in deterministic completions,
     while 40 means 40 words are considered at each step. 0 (default) is a
     special setting meaning no restrictions. 40 generally is a good value.
     :models_dir : path to parent folder containing model subfolders
     (i.e. contains the <model_name> folder)
    """
    np.random.seed(1)
    tf.set_random_seed(1)

    if data_type == 'fp32':
        tf_data_type = tf.float32
    elif data_type == 'fp16':
        tf_data_type = tf.float16
    else:
        assert False

    vocab_file=os.path.join(models_dir, model_name, 'encoder.json')
    bpe_file=os.path.join(models_dir, model_name, 'vocab.bpe')
    enc = encoder.get_encoder(vocab_file, bpe_file)
    hparams = HParams(n_vocab=0,
                      n_ctx=1024,
                      n_embd=768,
                      n_head=12,
                      n_layer=12)

    with open(os.path.join(models_dir, model_name, 'hparams.json')) as f:
        hparams.override_from_dict(json.load(f))

    if max_seq_len is None:
        max_seq_len = hparams.n_ctx
    elif max_seq_len > hparams.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)

    config = tf.ConfigProto()
    config.gpu_options.allow_growth = True
    with tf.Session(graph=tf.Graph(), config=config) as sess:
        saver = tf.train.import_meta_graph("{}/{}/model.ckpt.meta".format(models_dir, model_name))

        lengths = np.array([1]*batch_size)
        min_start_length = lengths.min()
        max_start_length = lengths.max()
        attention_mask = np.tile(np.tri(min_start_length), (batch_size, 1, 1))

        start_ids = np.ones([batch_size, max_start_length]) * enc.encoder['!']
        print(f"start_ids: {start_ids}")

        sess.run(tf.global_variables_initializer())
        print("[INFO] restore the model {}/{}".format(models_dir, model_name))
        saver.restore(sess, ("{}/{}/model.ckpt".format(models_dir, model_name)))

        decoder_args = TransformerArgument(beam_width=1,
                                           head_num=hparams.n_head,
                                           size_per_head=hparams.n_embd // hparams.n_head,
                                           num_layer=hparams.n_layer,
                                           dtype=tf_data_type,
                                           kernel_init_range=0.00,
                                           bias_init_range=0.00)

        decoding_args = DecodingGpt2Argument(hparams.n_vocab,
                                             enc.encoder['<|endoftext|>'],
                                             enc.encoder['<|endoftext|>'],
                                             max_seq_len,
                                             decoder_args,
                                             top_k,
                                             top_p,
                                             temperature)

        ckpt_dict = {}
        for var in tf.trainable_variables():
            ckpt_dict[var.name] = var
        decoding_vars = tf.trainable_variables()

        op_output = ft_gpt_op(decoding_vars,
                              decoding_args,
                              batch_size,
                              start_ids,
                              min_start_length,
                              max_start_length,
                              attention_mask)

        generated = 0
        num_tokens = 0
        while nsamples == 0 or generated < nsamples:
            op_out = sess.run(op_output)

            for i in range(batch_size):
                generated += 1

                text = enc.decode(op_out[i])
                num_tokens = len(op_out[i])
                print(f"tokens: {op_out[i]}")
                print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
                print(text)

        # Measure inference time.
        if time:
            time_cost = time_test(sess, op_output, iterations=10)
            # the first token would always be 0 since the current gpt op doesn't take in context
            print("[INFO] GPT time costs: {:.2f} ms and number of generated tokens: {}".format(time_cost, num_tokens-1))

def preprocess_decoder_var(decoding_vars,
                            num_layer,
                            using_model_var,
                            checkpoint_filename,
                            data_type,
                            fuse_qkv=True):
    '''
    Args:
        decoding_vars: A list of tf.Tensor. The variables of decoding.  
        num_layer: A int value. The number of transformer layer of decoder in decoding
        using_model_var: A bool value. Using the model variables of TensorFlow or not.
                         If True, then putting the model variables of TensorFlow decoding model into decoding op directly. 
                            The data type is tensor of TensorFlow in this case. 

                         If False, then restoring the values of variables from the checkpoint_filename, and putting
                         the values into decoding op.
                            The data type is numpy is this case. 
        checkpoint_file: A string. The checkpoint file name of storing the values of model. The checkpoint should be stored in 
                         pickle, and the name of checkpoint should be xxx.pkl.
                         The model is saved by dict. 
                         The key of the dict is the name of variables
                         The value of the dict is the values of variables
                         For example, decoding_vars[0]=<tf.Variable 'transformer/decoder/layer_0/masked_multi_head/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>,
                         then the key is 'transformer/decoder/layer_0/masked_multi_head/LayerNorm/beta:0'; the value is sess.run(decoding_vars[0])
        data_type: tf.float32 or tf.float16. 
                   Only used when using_model_var is False. Convert the numpy data to the data type of model.

    Outputs:
        vars_in_diff_layers_dict: A dict to store the variables by their name.

                                For decoder variables, the key is like 'transformer/decoder/layer/masked_multi_head/LayerNorm/beta:0', 
                                which is similar to the name of variables, except we use 'layer' but not 'layer_x'. The value is a list, 
                                which contains 'transformer/decoder/layer_%d/masked_multi_head/LayerNorm/beta:0' % i for i in range(num_layer)

                                For other variables, the key is the name of variable, and the value is the correspoding weight.

                                Note that we return the concated weights. The concat operation would bring other overhead, and this should be optimized in 
                                the real application. The recommended method is pre-processing the weights as numpy format. Because TensorFlow do the operations
                                for each inference if using the TensorFlow to pre-process the weights.
    '''

    var_dict = {}
    for var in decoding_vars:
        var_dict[var.name] = var

    vars_in_diff_layers_dict = {}
    vars_in_diff_layers_dict["transformer/decoder/LayerNorm/beta:0"] = tf.cast(var_dict["model/ln_f/b:0"], dtype=data_type)
    vars_in_diff_layers_dict["transformer/decoder/LayerNorm/gamma:0"] = tf.cast(var_dict["model/ln_f/g:0"], dtype=data_type)
    vars_in_diff_layers_dict["model/wpe:0"] = tf.cast(var_dict["model/wpe:0"], dtype=data_type)
    vars_in_diff_layers_dict["model/wte:0"] = tf.cast(var_dict["model/wte:0"], dtype=data_type)

    for i in range(num_layer):
        """Handling the names of q, k, v kernel and bias because their names
        are different for fusing the qkv or not."""

        layer_prefix_name = "transformer/decoder/layer_%d/" % i
        gpt2_layer_prefix_namx = "model/h%d/" % i

        var_dict[layer_prefix_name + 'masked_multi_head/query/kernel:0'], \
        var_dict[layer_prefix_name + 'masked_multi_head/key/kernel:0'], \
        var_dict[layer_prefix_name + 'masked_multi_head/value/kernel:0'] = tf.split(var_dict[gpt2_layer_prefix_namx + 'attn/c_attn/w:0'], 3, axis=-1)

        var_dict[layer_prefix_name + 'masked_multi_head/query/bias:0'], \
        var_dict[layer_prefix_name + 'masked_multi_head/key/bias:0'], \
        var_dict[layer_prefix_name + 'masked_multi_head/value/bias:0'] = tf.split(var_dict[gpt2_layer_prefix_namx + 'attn/c_attn/b:0'], 3, axis=-1)

    layer_prefix_name = 'transformer/decoder/layer'
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/LayerNorm/beta:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/ln_1/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/LayerNorm/gamma:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/ln_1/g:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)

    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d/kernel:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/attn/c_attn/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d/bias:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/attn/c_attn/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/query/kernel:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/query/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/query/bias:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/query/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/key/kernel:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/key/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/key/bias:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/key/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/value/kernel:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/value/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/value/bias:0'] = \
        tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/value/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)

    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d_1/kernel:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/attn/c_proj/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d_1/bias:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/attn/c_proj/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)

    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/LayerNorm/beta:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/ln_2/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/LayerNorm/gamma:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/ln_2/g:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)

    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d/kernel:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_fc/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d/bias:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_fc/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d_1/kernel:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_proj/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
    vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d_1/bias:0'] = \
        tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_proj/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)

    return vars_in_diff_layers_dict

def ft_gpt_op(decoding_vars,
              decoding_args,
              batch_size,
              start_ids,
              min_start_length,
              max_start_length,
              attention_mask):
    """Run the decoding with sampling by FasterTransformer.

    Args:
        decoder_vars: A list of tf.Tensor. The variables for decoding. A list of model variables of TensorFlow model.
        decoder_args: The arguments for decoding. The details are in the class "DecodingGpt2Argument" of common.py
    Outputs:
        output_ids: A tf.Tensor with shape [batch_size, max(sequence_lengths)], with int type.
                    The results of decoding. It contains the id of token of vocabulary.
        sequence_lengths: A tf.Tensor with shape [batch_size], with int type.
    """
    decoder_args = decoding_args.decoder_args
    decoding_op_module = tf.load_op_library(os.path.join('./lib/libtf_gpt.so'))
    data_type = decoder_args.dtype

    vars_dict_in_differ_layers = preprocess_decoder_var(decoding_vars,
                                                        decoder_args.num_layer,
                                                        True,
                                                        None,
                                                        data_type,
                                                        False)
    if decoder_args.fuse_qkv == True:
        masked_multi_head_first_kernel = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d/kernel:0']
        masked_multi_head_first_bias = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d/bias:0']
    else:
        masked_multi_head_first_kernel = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/query/kernel:0'], # 4
        masked_multi_head_first_bias = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/query/bias:0'], # 5

    output_ids = decoding_op_module.decoding_gpt(
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/LayerNorm/beta:0'], # 0
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/LayerNorm/gamma:0'], # 1
        masked_multi_head_first_kernel,
        masked_multi_head_first_bias,
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/key/kernel:0'], # 4
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/key/bias:0'], # 5
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/value/kernel:0'], # 6
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/value/bias:0'], # 7
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d_1/kernel:0'], # 8
        vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d_1/bias:0'],  # 9
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/LayerNorm/beta:0'], # 10
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/LayerNorm/gamma:0'], # 11
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d/kernel:0'], # 12
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d/bias:0'], # 13
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d_1/kernel:0'], # 14
        vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d_1/bias:0'], # 15
        vars_dict_in_differ_layers['transformer/decoder/LayerNorm/beta:0'], # 16
        vars_dict_in_differ_layers['transformer/decoder/LayerNorm/gamma:0'], # 17
        vars_dict_in_differ_layers['model/wte:0'], # 18
        vars_dict_in_differ_layers['model/wte:0'], # 19
        vars_dict_in_differ_layers['model/wpe:0'], # 20
        attention_mask, # 21
        start_ids, # 22
        min_start_length, # 23
        max_start_length, # 24
        batch_size=batch_size,
        candidate_num=decoding_args.top_k,
        probability_threshold=decoding_args.top_p,
        max_seq_len=decoding_args.max_seq_len,
        head_num=decoder_args.head_num, 
        size_per_head=decoder_args.size_per_head,
        num_layer=decoder_args.num_layer,
        start_id=decoding_args.start_id, 
        end_id=decoding_args.end_id,
        temperature=decoding_args.temperature,
        is_fuse_qkv=decoder_args.fuse_qkv
    )

    output_ids = tf.transpose(output_ids, [1, 0])
    return output_ids

if __name__ == '__main__':
    fire.Fire(sample_model)

pytorch/gpt_sample.py

from __future__ import print_function

import os
import argparse
import timeit
import torch
import numpy as np
import utils.gpt_token_encoder as encoder
from torch.nn.utils.rnn import pad_sequence

from utils.gpt import GPT, GPTWeights

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--layer_num', type=int, default=12,
                        help='number of layers')
    parser.add_argument('--output_len', type=int, default=7,
                        help='output sequence length to generate.')
    parser.add_argument('--head_num', type=int, default=12,
                        help='head number')
    parser.add_argument('--size_per_head', type=int, default=64,
                        help='size per head')
    parser.add_argument('--vocab_size', type=int, default=50257,
                        help='vocab size')
    parser.add_argument('--top_k', type=int, default=1,
                        help='top k candidate num')
    parser.add_argument('--top_p', type=float, default=0.,
                        help='top p probability threshold')
    parser.add_argument('--temperature', type=float, default=1.,
                        help='temperature')
    parser.add_argument('--is_fuse_QKV', type=bool, default=True,
                        help='whether or not to fuse QKV')
    parser.add_argument('--tensor_para_size', type=int, default=1,
                        help='tensor parallel size')
    parser.add_argument('--layer_para_size', type=int, default=1,
                        help='layer parallel size')
    parser.add_argument('--layer_para_batch_size', type=int, default=1,
                        help='local batch size for pipeline parallel')
    parser.add_argument('--ckpt_path', type=str, default='./models/c-model/124m/1-gpu',
                        help='path to the checkpoint file.')
    parser.add_argument('--lib_path', type=str, default='./lib/libpyt_fastertransformer.so',
                        help='path to the pyt_fastertransformer dynamic lib file.')
    parser.add_argument('--vocab_file', type=str, default="./models/gpt2-vocab.json",
                        help='vocabulary file.')
    parser.add_argument('--merges_file', type=str, default="./models/gpt2-merges.txt",
                        help='merges file.')
    parser.add_argument('--start_id', type=int, default=50256,
                        help='start token id.')
    parser.add_argument('--end_id', type=int, default=50256,
                        help='end token id.')
    parser.add_argument('--max_batch_size', type=int, default=1,
                        help='max batch size.')
    parser.add_argument('--max_seq_len', type=int, default=8,
                        help='max sequence length.')
    parser.add_argument('--fp16', action='store_true',
                        help='whether or not to run in fp16')
    parser.add_argument('--time', action='store_true',
                        help='whether or not to measure time elapsed.')
    parser.add_argument('--sample_input_file', type=str, default=None,
                        help='path to sample input file. If not set, it runs with no context inputs.')
    parser.add_argument('--sample_output_file', type=str, default=None,
                        help='path to sample output file.')

    args = parser.parse_args()

    layer_num = args.layer_num
    output_len = args.output_len
    head_num = args.head_num
    size_per_head = args.size_per_head
    vocab_size = args.vocab_size
    top_k = args.top_k
    top_p = args.top_p
    temperature = args.temperature
    is_fuse_QKV = args.is_fuse_QKV
    tensor_para_size = args.tensor_para_size
    layer_para_size = args.layer_para_size
    layer_para_batch_size = args.layer_para_batch_size
    start_id = args.start_id
    end_id = args.end_id
    max_batch_size = args.max_batch_size
    max_seq_len = args.max_seq_len

    print("\n=============== Arguments ===============")
    for arg in vars(args):
        print ("{}: {}".format(arg, getattr(args, arg)))
    print("=========================================\n")

    enc = encoder.get_encoder(args.vocab_file, args.merges_file)

    # Inputs
    contexts = []
    if args.sample_input_file:  # conditional case
        with open(args.sample_input_file, "r") as f:
            contexts = f.read().splitlines()
            batch_size = min(len(contexts), max_batch_size)
        contexts = contexts[:batch_size]
        start_ids = [torch.IntTensor(enc.encode(c)) for c in contexts]
    else:  # unconditional case
        batch_size = max_batch_size
        contexts = ['!'] * batch_size
        start_ids = [torch.IntTensor([0])] * batch_size

    print("[INFO] batch size: {}".format(batch_size))

    start_lengths = [len(ids) for ids in start_ids]
    input_len = min(start_lengths)

    start_ids = pad_sequence(start_ids, batch_first=True, padding_value=end_id)
    start_lengths = torch.IntTensor(start_lengths)
    attn_mask = torch.ones((batch_size, input_len, input_len)).tril()

    # Prepare model.
    gpt = GPT(head_num, size_per_head, vocab_size, start_id, end_id,
              layer_num, top_k, top_p, temperature, output_len, max_seq_len, 
              tensor_para_size, layer_para_size, layer_para_batch_size, 
              is_fuse_QKV, max_batch_size, lib_path=args.lib_path)
    gpt.load(ckpt_path=args.ckpt_path)
    if args.fp16:
        gpt.half()
    gpt.cuda()

    with torch.no_grad():
        # Generate tokens.
        tokens_batch = gpt(start_ids, start_lengths, attn_mask)
        generated_token = None
        if tokens_batch is not None:  # only a thread (rank 0) gets the output, while the others are supposed to return None.
            outputs = []
            tokens_batch = tokens_batch.cpu().numpy()
            for i, (context, tokens) in enumerate(zip(contexts, tokens_batch)):
                token = tokens[start_lengths[i]:]  # exclude context input from the output
                generated_token = token
                output = enc.decode(tokens[start_lengths[i]:])
                outputs.append(output)
                print("[INFO] batch {}: \n[Context]\n{}\n\n[Output]\n{}".format(i, context, output))

            if args.sample_output_file:
                with open(args.sample_output_file, "w+") as f:
                    outputs = [o.replace("\n","\\n") for o in outputs]
                    f.writelines("\n".join(outputs))

        # Measure inference time.
        if args.time:
            iterations = 10
            for i in range(iterations):
                tokens_batch = gpt(start_ids, start_lengths, attn_mask)

            time = timeit.default_timer()
            for i in range(iterations):
                tokens_batch = gpt(start_ids, start_lengths, attn_mask)
            time_elapsed = timeit.default_timer() - time
            print(f"generated token: {generated_token}")
            print("[INFO] GPT time costs: {:.2f} ms and number of generated tokens {}".format(time_elapsed*1000/iterations, len(generated_token)))

if __name__ == '__main__':
    main()
byshiue commented 3 years ago

From testing of my side, tf op is little faster than pytorch op. Please use more iteration to run, 10 iterations are few, especially you use batch size 1. I test on V100, nvcr.io/nvidia/pytorch:20.12 and nvcr.io/nvidai/tensorflow:20.12-tf1-py3 docker images. tf fp32 time: 20.79 ms py fp32 time: 26.76 ms tf fp16 time: 13.80 ms py fp16 time: 15.47 ms

gyin94 commented 3 years ago

Can I ask whether you used the modified python files to keep the parameters same? The default sampling python scripts will not be appropriate for this test.

Can you help try it with T4 as well? My current number comes from T4. Thanks

byshiue commented 3 years ago

Peformance on T4:

py fp32: 21.78
tf fp32: 20.13
py fp16: 17.74
tf fp16: 15.61

I use the script you provide above and run by

python pytorch/gpt_sample.py --time
python tensorflow/gpt_sample.py
python pytorch/gpt_sample.py --time --fp16
python tensorflow/gpt_sample.py --data_type=fp16
gyin94 commented 3 years ago

Can I ask how many iterations you use? Thanks

byshiue commented 3 years ago

100.

gyin94 commented 3 years ago

I couldn't achieve that number even if I increased the iterations to 200. It remains the same as 22ms for tf op fp16. Since I am using the main branch latest code of FasterTransformer and wonder which branch you use for testing.

T4 variables:

NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.1
byshiue commented 3 years ago

main branch

gyin94 commented 3 years ago

@byshiue do you mind sharing your nvidia-smi output? (inside and outside docker container if possible). I am just wondering whether driver version, GPU memory size or other settings that can affect. Thanks

gyin94 commented 3 years ago

I have also tried V100 and T4. Here is the summary. All experiments use the same docker image and model. It seems like the speed difference might be due to driver version or CUDA Version? @byshiue

Fast means ~15ms and Slow means ~22ms

Driver 460.73.01 CUDA 11.2
V100: Fast T4: Slow

Driver 450.51.05 CUDA 11.1 T4: Slow

FT developer team (Unknown Driver and CUDA version) T4: Fast

byshiue commented 3 years ago

We run on the Driver 460.73.01. I don't understand the meaning of "V100: Fast, T4: Slow" You can try to generate the profiling by the profiling tools (nvprof, nsight-systems) to check the difference if the performances of pytorch and tensorflow are different.

gyin94 commented 3 years ago

I can get the similar latency with yours on V100 however the latency number is different on T4. I can only get 22ms on T4 but your experiment achieves 13ms, close to V100. Here are variables we keep the same.

Driver version is the same: 460.73.01 NVIDIA Docker: 20.12-tf1-py3 Hardware: T4 FasterTransformer: main branch Model and sampling tool: same

I can confirm the latency difference mainly comes from the step of generating the first token between pytorch and tensorflow. The incremental latency for 2nd token, 3rd token and etc is quite close. Something is not working appropriately during the model initialization on T4 for tensorflow op.

T4-pytorch tokens ms
1 4.23
2 5.56
3 7.03
4 8.7
5 10.18
6 11.71
7 13.25
8 14.8
T4-tensorflow tokens ms
1 12.48
2 14.05
3 15.58
4 17.29
5 18.79
6 20.47
7 21.99
8 23.73
V100-Tensorflow tokens ms
1 5.39
2 6.36
3 7.31
4 8.31
5 9.32
6 10.34
7 11.33
8 12.27

can I ask what your CUDA driver Version is from Nvidia-smi command? Are you using 32GB T4 or 16GB T4? Do you mind testing the tensorflow gpt_sample.py on your T4 machine by changing max_seq_len from 2 to 9 and see how it goes up, especially 2? Thanks.

byshiue commented 3 years ago

CUDA Driver is 460.73.01 on nvidia-smi command. T4 is 16 GB.

gyin94 commented 3 years ago

is CUDA Version 11.2, 11.1 or 11.3 from Nvidia-smi?

byshiue commented 3 years ago

11.2

byshiue commented 3 years ago

Since the problem is too small and the performance may be unstable. In general, the performance of pytorch would be better than tf when the batch size / seqlen are small because tf need to concat and copy all weights every time, but pytorch op do it before forward. V100 is faster than T4 is normal because its TLOPs is larger than T4.

gyin94 commented 3 years ago

I am confused why I couldn't reproduce your results in T4. What could I miss?

Btw, in terms of tf weights initialization, can I ask about your concern which step we couldn't retrieve the weight during the initialization? Here is an example to add tensor as attr.

https://stackoverflow.com/questions/44167676/what-python-types-does-tensorflow-accept-for-attrs-of-type-tensor https://github.com/tensorflow/tensorflow/blob/6ca5f397a7075a8d4a380a7fd0137702246221c9/tensorflow/core/framework/op_def_builder.cc#L175

gyin94 commented 3 years ago

@byshiue more data for your reference about TF/Pytorch first token latency comparison, which might be due to model weight initialization. If lightseq or other paper uses Fastertransformer tensorflow version for comparison, this may be one of reasons they found it had a better performance than FasterTransformer in short sequence length.

Observation: the latency difference happens in the first token and the per token latency keeps the same since the 2nd token

(V100 fp16) Latency in ms Output token position Pytorch Latency TF Latency
1 2.83 5.39
2 1.01 0.97
3 0.99 0.95
4 1 1
5 0.98 1.01
6 0.94 1.02
7 0.98 0.99
8 1.03 0.94
latency for all output tokens Output Tokens Length pytorch Latency TF Latency
1 2.83 5.39
2 3.84 6.36
3 4.83 7.31
4 5.83 8.31
5 6.81 9.32
6 7.75 10.34
7 8.73 11.33
8 9.76 12.27
byshiue commented 3 years ago

For larger batch size or longer sequence length, the effect is very small. We will try to solve this problem in the next version.