Closed gyin94 closed 3 years ago
Since d_attn_mask is declared as "T", but not "const T". You can try to modify the declaration to "const T*" directly to fix this problem.
Thanks. It solves the problem. @byshiue can I ask why we decide to initialize transformer weights in Compute instead of Constructor in tf_op? However transformer weights initialization is in Constructor for th_op and C++ interface.
Because we cannot get the weight of tensorflow during constructor.
Do you mean we can't add the params and decoding_params in the private variables section in TF op and initialize them in the constructor? Since TF op can't retrieve and use params
in the Compute section?
torch op:
private:
...
const int max_batch_size_;
DecoderInitParam<T> *param;
DecodingInitParam<T> decoding_params;
};
We can not be sure that the weights are same every time.
We can not be sure that the weights are same every time.
I am a little confused. params
is the model weight. They will be same as long as it is the same model. FasterTransformer torch op and C++ interface did the weight initialization in the Constructor instead of Compute. Why do we make the TF op different from them?
If I understand it correctly, the variant would be batch_size, start_ids and attention_mask for requests and computing instead of model weights (after initialization)
In TensorFlow, we cannot get the weights during constructor. So we set the weight during compute. Users can also change the weight during different computing if they need. And the setting does not bring overhead.
@byshiue This is wired. Pytorch op is way faster than tf op (1.6x). Check below two modified python files to reproduce. They use the exact same decoding parameters and input. (I also noticed this big latency difference with context enabled tf op and torch op). Any idea or suggestion on this circumstance?
model: openai/gpt2/124M fp16=true, max_seq_len=8, output_len=7, start_ids=[0] and batch_size=1. latency: pytorch op 13.9ms, tf op 22.3ms
./bin/gpt_gemm 1 1 12 64 50257 8 1 1
tensorflow/gpt_sample.py
import fire
import json
import os
import numpy as np
import tensorflow as tf
from tensorflow.contrib.training import HParams
import sys
sys.path.append("../sample")
import pytorch.utils.gpt_token_encoder as encoder
from utils.common import TransformerArgument
from utils.common import DecodingGpt2Argument
from utils.common import time_test
from utils.encoder import build_sequence_mask
def sample_model(
model_name='124M',
nsamples=1,
batch_size=1,
max_seq_len=8,
temperature=1,
top_k=1,
top_p=0,
models_dir='models',
data_type='fp32',
time=True,
):
"""Run the sample_model.
:model_name=124M : String, which model to use
:nsamples=0 : Number of samples to return, if 0, continues to
generate samples indefinately.
:batch_size=1 : Number of batches (only affects speed/memory).
:length=None : Number of tokens in generated text, if None (default), is
determined by model hyperparameters
:temperature=1 : Float value controlling randomness in boltzmann
distribution. Lower temperature results in less random completions. As the
temperature approaches zero, the model will become deterministic and
repetitive. Higher temperature results in more random completions.
:top_k=4 : Integer value controlling diversity. 1 means only 1 word is
considered for each step (token), resulting in deterministic completions,
while 40 means 40 words are considered at each step. 0 (default) is a
special setting meaning no restrictions. 40 generally is a good value.
:models_dir : path to parent folder containing model subfolders
(i.e. contains the <model_name> folder)
"""
np.random.seed(1)
tf.set_random_seed(1)
if data_type == 'fp32':
tf_data_type = tf.float32
elif data_type == 'fp16':
tf_data_type = tf.float16
else:
assert False
vocab_file=os.path.join(models_dir, model_name, 'encoder.json')
bpe_file=os.path.join(models_dir, model_name, 'vocab.bpe')
enc = encoder.get_encoder(vocab_file, bpe_file)
hparams = HParams(n_vocab=0,
n_ctx=1024,
n_embd=768,
n_head=12,
n_layer=12)
with open(os.path.join(models_dir, model_name, 'hparams.json')) as f:
hparams.override_from_dict(json.load(f))
if max_seq_len is None:
max_seq_len = hparams.n_ctx
elif max_seq_len > hparams.n_ctx:
raise ValueError("Can't get samples longer than window size: %s" % hparams.n_ctx)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
with tf.Session(graph=tf.Graph(), config=config) as sess:
saver = tf.train.import_meta_graph("{}/{}/model.ckpt.meta".format(models_dir, model_name))
lengths = np.array([1]*batch_size)
min_start_length = lengths.min()
max_start_length = lengths.max()
attention_mask = np.tile(np.tri(min_start_length), (batch_size, 1, 1))
start_ids = np.ones([batch_size, max_start_length]) * enc.encoder['!']
print(f"start_ids: {start_ids}")
sess.run(tf.global_variables_initializer())
print("[INFO] restore the model {}/{}".format(models_dir, model_name))
saver.restore(sess, ("{}/{}/model.ckpt".format(models_dir, model_name)))
decoder_args = TransformerArgument(beam_width=1,
head_num=hparams.n_head,
size_per_head=hparams.n_embd // hparams.n_head,
num_layer=hparams.n_layer,
dtype=tf_data_type,
kernel_init_range=0.00,
bias_init_range=0.00)
decoding_args = DecodingGpt2Argument(hparams.n_vocab,
enc.encoder['<|endoftext|>'],
enc.encoder['<|endoftext|>'],
max_seq_len,
decoder_args,
top_k,
top_p,
temperature)
ckpt_dict = {}
for var in tf.trainable_variables():
ckpt_dict[var.name] = var
decoding_vars = tf.trainable_variables()
op_output = ft_gpt_op(decoding_vars,
decoding_args,
batch_size,
start_ids,
min_start_length,
max_start_length,
attention_mask)
generated = 0
num_tokens = 0
while nsamples == 0 or generated < nsamples:
op_out = sess.run(op_output)
for i in range(batch_size):
generated += 1
text = enc.decode(op_out[i])
num_tokens = len(op_out[i])
print(f"tokens: {op_out[i]}")
print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
print(text)
# Measure inference time.
if time:
time_cost = time_test(sess, op_output, iterations=10)
# the first token would always be 0 since the current gpt op doesn't take in context
print("[INFO] GPT time costs: {:.2f} ms and number of generated tokens: {}".format(time_cost, num_tokens-1))
def preprocess_decoder_var(decoding_vars,
num_layer,
using_model_var,
checkpoint_filename,
data_type,
fuse_qkv=True):
'''
Args:
decoding_vars: A list of tf.Tensor. The variables of decoding.
num_layer: A int value. The number of transformer layer of decoder in decoding
using_model_var: A bool value. Using the model variables of TensorFlow or not.
If True, then putting the model variables of TensorFlow decoding model into decoding op directly.
The data type is tensor of TensorFlow in this case.
If False, then restoring the values of variables from the checkpoint_filename, and putting
the values into decoding op.
The data type is numpy is this case.
checkpoint_file: A string. The checkpoint file name of storing the values of model. The checkpoint should be stored in
pickle, and the name of checkpoint should be xxx.pkl.
The model is saved by dict.
The key of the dict is the name of variables
The value of the dict is the values of variables
For example, decoding_vars[0]=<tf.Variable 'transformer/decoder/layer_0/masked_multi_head/LayerNorm/beta:0' shape=(512,) dtype=float32_ref>,
then the key is 'transformer/decoder/layer_0/masked_multi_head/LayerNorm/beta:0'; the value is sess.run(decoding_vars[0])
data_type: tf.float32 or tf.float16.
Only used when using_model_var is False. Convert the numpy data to the data type of model.
Outputs:
vars_in_diff_layers_dict: A dict to store the variables by their name.
For decoder variables, the key is like 'transformer/decoder/layer/masked_multi_head/LayerNorm/beta:0',
which is similar to the name of variables, except we use 'layer' but not 'layer_x'. The value is a list,
which contains 'transformer/decoder/layer_%d/masked_multi_head/LayerNorm/beta:0' % i for i in range(num_layer)
For other variables, the key is the name of variable, and the value is the correspoding weight.
Note that we return the concated weights. The concat operation would bring other overhead, and this should be optimized in
the real application. The recommended method is pre-processing the weights as numpy format. Because TensorFlow do the operations
for each inference if using the TensorFlow to pre-process the weights.
'''
var_dict = {}
for var in decoding_vars:
var_dict[var.name] = var
vars_in_diff_layers_dict = {}
vars_in_diff_layers_dict["transformer/decoder/LayerNorm/beta:0"] = tf.cast(var_dict["model/ln_f/b:0"], dtype=data_type)
vars_in_diff_layers_dict["transformer/decoder/LayerNorm/gamma:0"] = tf.cast(var_dict["model/ln_f/g:0"], dtype=data_type)
vars_in_diff_layers_dict["model/wpe:0"] = tf.cast(var_dict["model/wpe:0"], dtype=data_type)
vars_in_diff_layers_dict["model/wte:0"] = tf.cast(var_dict["model/wte:0"], dtype=data_type)
for i in range(num_layer):
"""Handling the names of q, k, v kernel and bias because their names
are different for fusing the qkv or not."""
layer_prefix_name = "transformer/decoder/layer_%d/" % i
gpt2_layer_prefix_namx = "model/h%d/" % i
var_dict[layer_prefix_name + 'masked_multi_head/query/kernel:0'], \
var_dict[layer_prefix_name + 'masked_multi_head/key/kernel:0'], \
var_dict[layer_prefix_name + 'masked_multi_head/value/kernel:0'] = tf.split(var_dict[gpt2_layer_prefix_namx + 'attn/c_attn/w:0'], 3, axis=-1)
var_dict[layer_prefix_name + 'masked_multi_head/query/bias:0'], \
var_dict[layer_prefix_name + 'masked_multi_head/key/bias:0'], \
var_dict[layer_prefix_name + 'masked_multi_head/value/bias:0'] = tf.split(var_dict[gpt2_layer_prefix_namx + 'attn/c_attn/b:0'], 3, axis=-1)
layer_prefix_name = 'transformer/decoder/layer'
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/LayerNorm/beta:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/ln_1/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/LayerNorm/gamma:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/ln_1/g:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d/kernel:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/attn/c_attn/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d/bias:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/attn/c_attn/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/query/kernel:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/query/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/query/bias:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/query/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/key/kernel:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/key/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/key/bias:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/key/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/value/kernel:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/value/kernel:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/value/bias:0'] = \
tf.cast(tf.concat([ var_dict[layer_prefix_name + '_%d/masked_multi_head/value/bias:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d_1/kernel:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/attn/c_proj/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/masked_multi_head/conv1d_1/bias:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/attn/c_proj/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/LayerNorm/beta:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/ln_2/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/LayerNorm/gamma:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/ln_2/g:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d/kernel:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_fc/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d/bias:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_fc/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d_1/kernel:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_proj/w:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
vars_in_diff_layers_dict[layer_prefix_name + '/ffn/conv1d_1/bias:0'] = \
tf.cast(tf.concat([ var_dict['model/h%d/mlp/c_proj/b:0' % i] for i in range(num_layer) ], axis=0), dtype=data_type)
return vars_in_diff_layers_dict
def ft_gpt_op(decoding_vars,
decoding_args,
batch_size,
start_ids,
min_start_length,
max_start_length,
attention_mask):
"""Run the decoding with sampling by FasterTransformer.
Args:
decoder_vars: A list of tf.Tensor. The variables for decoding. A list of model variables of TensorFlow model.
decoder_args: The arguments for decoding. The details are in the class "DecodingGpt2Argument" of common.py
Outputs:
output_ids: A tf.Tensor with shape [batch_size, max(sequence_lengths)], with int type.
The results of decoding. It contains the id of token of vocabulary.
sequence_lengths: A tf.Tensor with shape [batch_size], with int type.
"""
decoder_args = decoding_args.decoder_args
decoding_op_module = tf.load_op_library(os.path.join('./lib/libtf_gpt.so'))
data_type = decoder_args.dtype
vars_dict_in_differ_layers = preprocess_decoder_var(decoding_vars,
decoder_args.num_layer,
True,
None,
data_type,
False)
if decoder_args.fuse_qkv == True:
masked_multi_head_first_kernel = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d/kernel:0']
masked_multi_head_first_bias = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d/bias:0']
else:
masked_multi_head_first_kernel = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/query/kernel:0'], # 4
masked_multi_head_first_bias = vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/query/bias:0'], # 5
output_ids = decoding_op_module.decoding_gpt(
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/LayerNorm/beta:0'], # 0
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/LayerNorm/gamma:0'], # 1
masked_multi_head_first_kernel,
masked_multi_head_first_bias,
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/key/kernel:0'], # 4
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/key/bias:0'], # 5
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/value/kernel:0'], # 6
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/value/bias:0'], # 7
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d_1/kernel:0'], # 8
vars_dict_in_differ_layers['transformer/decoder/layer/masked_multi_head/conv1d_1/bias:0'], # 9
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/LayerNorm/beta:0'], # 10
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/LayerNorm/gamma:0'], # 11
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d/kernel:0'], # 12
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d/bias:0'], # 13
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d_1/kernel:0'], # 14
vars_dict_in_differ_layers['transformer/decoder/layer/ffn/conv1d_1/bias:0'], # 15
vars_dict_in_differ_layers['transformer/decoder/LayerNorm/beta:0'], # 16
vars_dict_in_differ_layers['transformer/decoder/LayerNorm/gamma:0'], # 17
vars_dict_in_differ_layers['model/wte:0'], # 18
vars_dict_in_differ_layers['model/wte:0'], # 19
vars_dict_in_differ_layers['model/wpe:0'], # 20
attention_mask, # 21
start_ids, # 22
min_start_length, # 23
max_start_length, # 24
batch_size=batch_size,
candidate_num=decoding_args.top_k,
probability_threshold=decoding_args.top_p,
max_seq_len=decoding_args.max_seq_len,
head_num=decoder_args.head_num,
size_per_head=decoder_args.size_per_head,
num_layer=decoder_args.num_layer,
start_id=decoding_args.start_id,
end_id=decoding_args.end_id,
temperature=decoding_args.temperature,
is_fuse_qkv=decoder_args.fuse_qkv
)
output_ids = tf.transpose(output_ids, [1, 0])
return output_ids
if __name__ == '__main__':
fire.Fire(sample_model)
pytorch/gpt_sample.py
from __future__ import print_function
import os
import argparse
import timeit
import torch
import numpy as np
import utils.gpt_token_encoder as encoder
from torch.nn.utils.rnn import pad_sequence
from utils.gpt import GPT, GPTWeights
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--layer_num', type=int, default=12,
help='number of layers')
parser.add_argument('--output_len', type=int, default=7,
help='output sequence length to generate.')
parser.add_argument('--head_num', type=int, default=12,
help='head number')
parser.add_argument('--size_per_head', type=int, default=64,
help='size per head')
parser.add_argument('--vocab_size', type=int, default=50257,
help='vocab size')
parser.add_argument('--top_k', type=int, default=1,
help='top k candidate num')
parser.add_argument('--top_p', type=float, default=0.,
help='top p probability threshold')
parser.add_argument('--temperature', type=float, default=1.,
help='temperature')
parser.add_argument('--is_fuse_QKV', type=bool, default=True,
help='whether or not to fuse QKV')
parser.add_argument('--tensor_para_size', type=int, default=1,
help='tensor parallel size')
parser.add_argument('--layer_para_size', type=int, default=1,
help='layer parallel size')
parser.add_argument('--layer_para_batch_size', type=int, default=1,
help='local batch size for pipeline parallel')
parser.add_argument('--ckpt_path', type=str, default='./models/c-model/124m/1-gpu',
help='path to the checkpoint file.')
parser.add_argument('--lib_path', type=str, default='./lib/libpyt_fastertransformer.so',
help='path to the pyt_fastertransformer dynamic lib file.')
parser.add_argument('--vocab_file', type=str, default="./models/gpt2-vocab.json",
help='vocabulary file.')
parser.add_argument('--merges_file', type=str, default="./models/gpt2-merges.txt",
help='merges file.')
parser.add_argument('--start_id', type=int, default=50256,
help='start token id.')
parser.add_argument('--end_id', type=int, default=50256,
help='end token id.')
parser.add_argument('--max_batch_size', type=int, default=1,
help='max batch size.')
parser.add_argument('--max_seq_len', type=int, default=8,
help='max sequence length.')
parser.add_argument('--fp16', action='store_true',
help='whether or not to run in fp16')
parser.add_argument('--time', action='store_true',
help='whether or not to measure time elapsed.')
parser.add_argument('--sample_input_file', type=str, default=None,
help='path to sample input file. If not set, it runs with no context inputs.')
parser.add_argument('--sample_output_file', type=str, default=None,
help='path to sample output file.')
args = parser.parse_args()
layer_num = args.layer_num
output_len = args.output_len
head_num = args.head_num
size_per_head = args.size_per_head
vocab_size = args.vocab_size
top_k = args.top_k
top_p = args.top_p
temperature = args.temperature
is_fuse_QKV = args.is_fuse_QKV
tensor_para_size = args.tensor_para_size
layer_para_size = args.layer_para_size
layer_para_batch_size = args.layer_para_batch_size
start_id = args.start_id
end_id = args.end_id
max_batch_size = args.max_batch_size
max_seq_len = args.max_seq_len
print("\n=============== Arguments ===============")
for arg in vars(args):
print ("{}: {}".format(arg, getattr(args, arg)))
print("=========================================\n")
enc = encoder.get_encoder(args.vocab_file, args.merges_file)
# Inputs
contexts = []
if args.sample_input_file: # conditional case
with open(args.sample_input_file, "r") as f:
contexts = f.read().splitlines()
batch_size = min(len(contexts), max_batch_size)
contexts = contexts[:batch_size]
start_ids = [torch.IntTensor(enc.encode(c)) for c in contexts]
else: # unconditional case
batch_size = max_batch_size
contexts = ['!'] * batch_size
start_ids = [torch.IntTensor([0])] * batch_size
print("[INFO] batch size: {}".format(batch_size))
start_lengths = [len(ids) for ids in start_ids]
input_len = min(start_lengths)
start_ids = pad_sequence(start_ids, batch_first=True, padding_value=end_id)
start_lengths = torch.IntTensor(start_lengths)
attn_mask = torch.ones((batch_size, input_len, input_len)).tril()
# Prepare model.
gpt = GPT(head_num, size_per_head, vocab_size, start_id, end_id,
layer_num, top_k, top_p, temperature, output_len, max_seq_len,
tensor_para_size, layer_para_size, layer_para_batch_size,
is_fuse_QKV, max_batch_size, lib_path=args.lib_path)
gpt.load(ckpt_path=args.ckpt_path)
if args.fp16:
gpt.half()
gpt.cuda()
with torch.no_grad():
# Generate tokens.
tokens_batch = gpt(start_ids, start_lengths, attn_mask)
generated_token = None
if tokens_batch is not None: # only a thread (rank 0) gets the output, while the others are supposed to return None.
outputs = []
tokens_batch = tokens_batch.cpu().numpy()
for i, (context, tokens) in enumerate(zip(contexts, tokens_batch)):
token = tokens[start_lengths[i]:] # exclude context input from the output
generated_token = token
output = enc.decode(tokens[start_lengths[i]:])
outputs.append(output)
print("[INFO] batch {}: \n[Context]\n{}\n\n[Output]\n{}".format(i, context, output))
if args.sample_output_file:
with open(args.sample_output_file, "w+") as f:
outputs = [o.replace("\n","\\n") for o in outputs]
f.writelines("\n".join(outputs))
# Measure inference time.
if args.time:
iterations = 10
for i in range(iterations):
tokens_batch = gpt(start_ids, start_lengths, attn_mask)
time = timeit.default_timer()
for i in range(iterations):
tokens_batch = gpt(start_ids, start_lengths, attn_mask)
time_elapsed = timeit.default_timer() - time
print(f"generated token: {generated_token}")
print("[INFO] GPT time costs: {:.2f} ms and number of generated tokens {}".format(time_elapsed*1000/iterations, len(generated_token)))
if __name__ == '__main__':
main()
From testing of my side, tf op is little faster than pytorch op. Please use more iteration to run, 10 iterations are few, especially you use batch size 1. I test on V100, nvcr.io/nvidia/pytorch:20.12 and nvcr.io/nvidai/tensorflow:20.12-tf1-py3 docker images. tf fp32 time: 20.79 ms py fp32 time: 26.76 ms tf fp16 time: 13.80 ms py fp16 time: 15.47 ms
Can I ask whether you used the modified python files to keep the parameters same? The default sampling python scripts will not be appropriate for this test.
Can you help try it with T4 as well? My current number comes from T4. Thanks
Peformance on T4:
py fp32: 21.78
tf fp32: 20.13
py fp16: 17.74
tf fp16: 15.61
I use the script you provide above and run by
python pytorch/gpt_sample.py --time
python tensorflow/gpt_sample.py
python pytorch/gpt_sample.py --time --fp16
python tensorflow/gpt_sample.py --data_type=fp16
Can I ask how many iterations you use? Thanks
100.
I couldn't achieve that number even if I increased the iterations to 200. It remains the same as 22ms for tf op fp16. Since I am using the main branch latest code of FasterTransformer and wonder which branch you use for testing.
T4 variables:
NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.1
main branch
@byshiue do you mind sharing your nvidia-smi
output? (inside and outside docker container if possible). I am just wondering whether driver version, GPU memory size or other settings that can affect. Thanks
I have also tried V100 and T4. Here is the summary. All experiments use the same docker image and model. It seems like the speed difference might be due to driver version or CUDA Version? @byshiue
Fast means ~15ms and Slow means ~22ms
Driver 460.73.01 CUDA 11.2
V100: Fast
T4: Slow
Driver 450.51.05 CUDA 11.1 T4: Slow
FT developer team (Unknown Driver and CUDA version) T4: Fast
We run on the Driver 460.73.01. I don't understand the meaning of "V100: Fast, T4: Slow" You can try to generate the profiling by the profiling tools (nvprof, nsight-systems) to check the difference if the performances of pytorch and tensorflow are different.
I can get the similar latency with yours on V100 however the latency number is different on T4. I can only get 22ms on T4 but your experiment achieves 13ms, close to V100. Here are variables we keep the same.
Driver version is the same: 460.73.01 NVIDIA Docker: 20.12-tf1-py3 Hardware: T4 FasterTransformer: main branch Model and sampling tool: same
I can confirm the latency difference mainly comes from the step of generating the first token between pytorch and tensorflow. The incremental latency for 2nd token, 3rd token and etc is quite close. Something is not working appropriately during the model initialization on T4 for tensorflow op.
T4-pytorch | tokens | ms |
---|---|---|
1 | 4.23 | |
2 | 5.56 | |
3 | 7.03 | |
4 | 8.7 | |
5 | 10.18 | |
6 | 11.71 | |
7 | 13.25 | |
8 | 14.8 |
T4-tensorflow | tokens | ms |
---|---|---|
1 | 12.48 | |
2 | 14.05 | |
3 | 15.58 | |
4 | 17.29 | |
5 | 18.79 | |
6 | 20.47 | |
7 | 21.99 | |
8 | 23.73 |
V100-Tensorflow | tokens | ms |
---|---|---|
1 | 5.39 | |
2 | 6.36 | |
3 | 7.31 | |
4 | 8.31 | |
5 | 9.32 | |
6 | 10.34 | |
7 | 11.33 | |
8 | 12.27 |
can I ask what your CUDA driver Version is from Nvidia-smi command? Are you using 32GB T4 or 16GB T4? Do you mind testing the tensorflow gpt_sample.py on your T4 machine by changing max_seq_len
from 2 to 9 and see how it goes up, especially 2? Thanks.
CUDA Driver is 460.73.01 on nvidia-smi command. T4 is 16 GB.
is CUDA Version 11.2, 11.1 or 11.3 from Nvidia-smi?
11.2
Since the problem is too small and the performance may be unstable. In general, the performance of pytorch would be better than tf when the batch size / seqlen are small because tf need to concat and copy all weights every time, but pytorch op do it before forward. V100 is faster than T4 is normal because its TLOPs is larger than T4.
I am confused why I couldn't reproduce your results in T4. What could I miss?
Btw, in terms of tf weights initialization, can I ask about your concern which step we couldn't retrieve the weight during the initialization? Here is an example to add tensor as attr.
https://stackoverflow.com/questions/44167676/what-python-types-does-tensorflow-accept-for-attrs-of-type-tensor https://github.com/tensorflow/tensorflow/blob/6ca5f397a7075a8d4a380a7fd0137702246221c9/tensorflow/core/framework/op_def_builder.cc#L175
@byshiue more data for your reference about TF/Pytorch first token latency comparison, which might be due to model weight initialization. If lightseq or other paper uses Fastertransformer tensorflow version for comparison, this may be one of reasons they found it had a better performance than FasterTransformer in short sequence length.
Observation: the latency difference happens in the first token and the per token latency keeps the same since the 2nd token
(V100 fp16) Latency in ms | Output token position | Pytorch Latency | TF Latency |
---|---|---|---|
1 | 2.83 | 5.39 | |
2 | 1.01 | 0.97 | |
3 | 0.99 | 0.95 | |
4 | 1 | 1 | |
5 | 0.98 | 1.01 | |
6 | 0.94 | 1.02 | |
7 | 0.98 | 0.99 | |
8 | 1.03 | 0.94 |
latency for all output tokens | Output Tokens Length | pytorch Latency | TF Latency |
---|---|---|---|
1 | 2.83 | 5.39 | |
2 | 3.84 | 6.36 | |
3 | 4.83 | 7.31 | |
4 | 5.83 | 8.31 | |
5 | 6.81 | 9.32 | |
6 | 7.75 | 10.34 | |
7 | 8.73 | 11.33 | |
8 | 9.76 | 12.27 |
For larger batch size or longer sequence length, the effect is very small. We will try to solve this problem in the next version.
I am trying to change the gpt_op.cc to be similar with gpt.h in torch op and hence use the start_ids and attention_mask. But I got the following error. Any idea or suggestion?
error: