NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.47k stars 2.1k forks source link

How to use ITensor object as nework.add_slice() parameter ? #1057

Closed pogevip closed 3 years ago

pogevip commented 3 years ago

I am trying to build a Bert model for extracting sentence representation ,

And I use mean_token_embedding.

I want get sentence length from input "input_ids" as network.add_slice() parameter.

ttyio commented 3 years ago

Hello @pogevip , thanks for reporting. Could you try add_shape and add_gather, the C++ implementation look like https://github.com/onnx/onnx-tensorrt/blob/master/onnx2trt_utils.cpp#L809, the python API should be similar.

pogevip commented 3 years ago

Hello @pogevip , thanks for reporting. Could you try add_shape and add_gather, the C++ implementation look like https://github.com/onnx/onnx-tensorrt/blob/master/onnx2trt_utils.cpp#L809, the python API should be similar.

sorry,我描述不清晰。

  1. 我想要获得bert的输入,input_ids
  2. 对input_ids求和,得到有效的序列长度A
  3. 然后用add_slice 取得bert_out的前A行,求平均,作为句子的representation输出

比如,为设置最大序列长度是128,我现在输入“你好啊”,有效长度是3, 我想要通过对input_ids求和,算出来有效的序列长度3,然后作为add_slice的参数,截取到前三行,求得mean_token_embedding

add_shape和add_shape好像不能实现这个??

ttyio commented 3 years ago

@pogevip , Hmm, tensorrt current has no data-dependent dynamic shape support. For you question, we have a variable length BERT in https://github.com/NVIDIA/TensorRT/tree/master/demo/BERT#variable-sequence-length, we use extra input cu_seqlens to record the valid lengths in each input. I will use example to try to explain the design, hope it helps:

e.g, for input: Sample # 1: AAA Sample # 2: BB Sampel # 3: CCCC

the fixed length with padding like this: AAAX BBXX CCCC

mask like this:

1 1 1 0 1 1 0 0 1 1 1 1

and the variable length input like this: AAABBCCCC

cu_seqlens like this: 0 3 5 9

pogevip commented 3 years ago
root@d28c658d236:/workspace/TensorRT/demo/BERT# python builder_varseqlen.py 
[TensorRT] INFO: Loading/transforming 199 weights
[TensorRT] ERROR: (Unnamed Layer* 0) [PluginV2DynamicExt]: could not find any supported formats consistent with input/output data types
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (872) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)
[TensorRT] ERROR: ../builder/cudnnBuilderGraphNodes.cpp (872) - Misc Error in reportPluginError: 0 (could not find any supported formats consistent with input/output data types)

@ttyio 我按照你的建议使用builder_varseqlen.py,我想要用FP32构建引擎,设置fp16=False,得到了上面的报错,于是我尝试fp16=True,又得到了下面的报错

root@d28c658d236:/workspace/TensorRT/demo/BERT# python builder_varseqlen.py 
[TensorRT] INFO: Using configuration file: /models/config.json
[TensorRT] INFO: Loading/transforming 199 weights
[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.
[TensorRT] INFO: Detected 4 inputs and 1 output network tensors.
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 232
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 232
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 232
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 232
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 232
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 243
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 238
CUDA Error: CUDA_ERROR_NOT_INITIALIZED /home/jenkins/workspace/OSS/L0_MergeRequest/oss/plugin/bertQKVToContextPlugin/fused_multihead_attention.h 243
[TensorRT] INFO: build engine in 15.827 Sec
[TensorRT] INFO: Saving Engine to /workspace/rt_demo/demo/BERT/output/pytorch_model.engine
[TensorRT] INFO: Done.

根据这个报错我修改了max_workspace_size,从2G 5G 一直到max_workspace_size =10G, 才能运行成功。

root@d28c658d236:/workspace/rt_demo/demo/BERT# python builder_val.py
[TensorRT] INFO: Using configuration file: /models/config.json
[TensorRT] INFO: Loading/transforming 199 weights
[TensorRT] INFO: Detected 4 inputs and 1 output network tensors.
[TensorRT] INFO: build engine in 42.014 Sec
[TensorRT] INFO: Saving Engine to /workspace/TensorRT/demo/BERT/output/pytorch_model.engine
[TensorRT] INFO: Done.

我想问一下,这是否正常,毕竟默认的max_workspace_size只有 1G而已。同时,我看到GPU使用率接近100%

另外,FP32的报错如何解决?

我看到和这个issues类似 https://github.com/NVIDIA/TensorRT/issues/726 ,但是无论 nvcr.io/nvidia/tensorrt:20.10-py3 还是 nvcr.io/nvidia/tensorrt:20.11-py3 都没有解决问题

我使用的是TensorRT包的版本是'7.2.1.6' Bert demo是20.10分支的 版本也是7.2.1.6 GPU RTX6000
CUDA 11.1 GPU Driver 455

pogevip commented 3 years ago

@ttyio 啊哈 我看到了这个,不支持FP32。只支持Xavier。。。

Note this is an experimental feature because we only support Xavier+ GPUs, also there is neither FP32 support nor INT8 PTQ calibration.
pogevip commented 3 years ago

@ttyio Happy Spring Festival ! fasterTransformer 3.1 作为 tensorrt plugin 似乎也是可以移除padding

ttyio commented 3 years ago

Happy New Year @pogevip ~

For the large workspace issue, it seems that CUDA_ERROR_NOT_INITIALIZED returned when call cuModuleLoadData (load the sass code to gpu) in https://github.com/NVIDIA/TensorRT/blob/release/7.2/plugin/bertQKVToContextPlugin/fused_multihead_attention.h#L232, this piece of code has known issue when run using multiple gpu, this is fixed internally and will release to public later. I know for the fixed length version BERT, when you use some non-optimal sequence length config, we will fall back to non-fused multi head kernel, and require larger workspace. I cannot tell how workspace related to this error in variable sequence length kernel. Could you try again to see how this CUDA_ERROR_NOT_INITIALIZED is fixed, did you only change the workspace size? thanks!

And Xavier+ GPU means that GPU has SM version >= 72, you can run on you RTX6000 which is Turing (sm75)

For the fasterTransformer 3.1, I am not familiar with it, glad to know it works for you~

pogevip commented 3 years ago

@ttyio

[TensorRT] INFO: Some tactics do not have sufficient workspace memory to run. Increasing workspace size may increase performance, please check verbose output.

一旦我使用-fp16就会提示这个,并且需要8G以上的max_workspace_size才能避免这个提示。

此前我并没有尝试过量化。FP32上是正常的,只需要500M的max_workspace_size。

于是我打印了日志,发现这个可能是原因,但是我不知道这些意味着什么? line 1527

[TensorRT] VERBOSE: --------------- Timing Runner: (Unnamed Layer* 1) [Fully Connected] (CudaConvolution)
[TensorRT] VERBOSE: CudaConvolution has no valid tactics for this config, skipping
[TensorRT] VERBOSE: --------------- Timing Runner: (Unnamed Layer* 1) [Fully Connected] (CudaDepthwiseConvolution)
[TensorRT] VERBOSE: CudaDepthwiseConvolution has no valid tactics for this config, skipping
[TensorRT] VERBOSE: --------------- Timing Runner: (Unnamed Layer* 1) [Fully Connected] (CublasConvolution)
[TensorRT] VERBOSE: CublasConvolution has no valid tactics for this config, skipping

log.txt

pogevip commented 3 years ago

@ttyio 在我看来这个错误的原因是没有找到最好的融合方案,如何解决这个问题呢? 我尝试加载模型做inferrnce,fp32 model cost 6ms ,but fp16 model cost 12ms

ttyio commented 3 years ago

Hello @pogevip , tensorrt will query available tactic according to your workspace memory budget. And when some tactic is rejected because of the workspace limitation, you will see the warning.

How did you get the fp32/fp16 performance data? Can 8G workspace fix the fp16 performance drop issue? If you run trtexec with option --dumpProfile --separateProfileRun flags, we can see the layerwise runtime, this is helpful to debug the fp16 performance issue.

pogevip commented 3 years ago

@ttyio I get the fp32/fp16 performance data from the modified inference.py.

import time
import json
import ctypes
import argparse
import collections
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

import helper_chinese.tokenization as tokenization
import helper_chinese.data_processing as dp

TRT_LOGGER = trt.Logger(trt.Logger.INFO)

def parse_args():
    """
    Parse command line arguments
    """
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('-e', '--engine',default='/workspace/rt_demo/demo/BERT/output/pytorch_model_v16.engine')
    parser.add_argument("-b", "--batch-size", default=1, type=int)

    parser.add_argument('-p', '--passage', nargs='*',help='Text for paragraph/passage for BERT QA',default='')
    parser.add_argument('-v', '--vocab-file',default='/workspace/rt_demo/demo/BERT/vocab.txt')
    parser.add_argument('-s', '--sequence-length',default=128, type=int)
    parser.add_argument('--max-query-length',default=128, type=int)
    parser.add_argument('--n-best-size', default=32, type=int)
    args, _ = parser.parse_known_args()
    return args

if __name__ == '__main__':
    args = parse_args()

    tokenizer = tokenization.FullTokenizer(vocab_file=args.vocab_file, do_lower_case=True)
    # When splitting up a long document into chunks, how much stride to take between chunks.
    doc_stride = 128
    # The maximum total input sequence length after WordPiece tokenization.
    # Sequences longer than this will be truncated, and sequences shorter
    max_seq_length = args.sequence_length

    def question_features(text):
        return dp.convert_examples_to_features(text, None, tokenizer, max_seq_length=128)

    # Import necessary plugins for BERT TensorRT
    handle = ctypes.CDLL("libnvinfer_plugin.so", mode=ctypes.RTLD_GLOBAL)
    if not handle:
        raise RuntimeError("Could not load plugin library. Is `libnvinfer_plugin.so` on your LD_LIBRARY_PATH?")

    # The first context created will use the 0th profile. A new context must be created
    # for each additional profile needed. Here, we only use batch size 1, thus we only need the first profile.
    with open(args.engine, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime, \
        runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

        # select engine profile
        selected_profile = -1
        num_binding_per_profile = engine.num_bindings // engine.num_optimization_profiles
        for idx in range(engine.num_optimization_profiles):
            profile_shape = engine.get_profile_shape(profile_index = idx, binding = idx * num_binding_per_profile)
            if profile_shape[0][1] <= args.batch_size and profile_shape[2][1] >= args.batch_size and profile_shape[0][0] <= max_seq_length and profile_shape[2][0] >= max_seq_length:
                selected_profile = idx
                break
        if selected_profile == -1:
            raise RuntimeError("Could not find any profile that can run batch size {}.".format(args.batch_size))

        context.active_optimization_profile = selected_profile
        binding_idx_offset = selected_profile * num_binding_per_profile

        # Specify input shapes. These must be within the min/max bounds of the active profile 
        # Note that input shapes can be specified on a per-inference basis, but in this case, we only have a single shape.
        input_shape = (max_seq_length, args.batch_size)
        input_nbytes = trt.volume(input_shape) * trt.int32.itemsize
        for binding in range(3):
            context.set_binding_shape(binding_idx_offset + binding, input_shape)
        assert context.all_binding_shapes_specified

        # Create a stream in which to copy inputs/outputs and run inference.
        stream = cuda.Stream()

        # Allocate device memory for inputs.
        d_inputs = [cuda.mem_alloc(input_nbytes) for binding in range(3)]

        # Allocate output buffer by querying the size from the context. This may be different for different input shapes.
        h_output = cuda.pagelocked_empty(tuple(context.get_binding_shape(binding_idx_offset + 3)), dtype=np.float32)
        d_output = cuda.mem_alloc(h_output.nbytes)

        def inference(feature):
            global h_output

            eval_time_elapsed = 0
            # for feature_index, feature in enumerate(features):
            input_ids_batch = np.dstack([feature["input_ids"]] * args.batch_size).squeeze()
            segment_ids_batch = np.dstack([feature["segment_ids"]] * args.batch_size).squeeze()
            input_mask_batch = np.dstack([feature["input_mask"]] * args.batch_size).squeeze()

            input_ids = cuda.register_host_memory(np.ascontiguousarray(input_ids_batch.ravel()))
            segment_ids = cuda.register_host_memory(np.ascontiguousarray(segment_ids_batch.ravel()))
            input_mask = cuda.register_host_memory(np.ascontiguousarray(input_mask_batch.ravel()))

            eval_start_time = time.time()
            cuda.memcpy_htod_async(d_inputs[0], input_ids, stream)
            cuda.memcpy_htod_async(d_inputs[1], segment_ids, stream)
            cuda.memcpy_htod_async(d_inputs[2], input_mask, stream)

            # Run inference
            context.execute_async_v2(bindings=[0 for i in range(binding_idx_offset)] + [int(d_inp) for d_inp in d_inputs] + [int(d_output)], stream_handle=stream.handle)
            # Synchronize the stream
            stream.synchronize()
            eval_time_elapsed += (time.time() - eval_start_time)

            # Transfer predictions back from GPU
            cuda.memcpy_dtoh_async(h_output, d_output, stream)
            stream.synchronize()

            return h_output

        time_a = time.time()
        features = question_features('今天天气不错')
        time_b = time.time()
        a  = inference(features)
        time_c = time.time()
        print('inference',  time_c-time_b)

And Increasing memory does not lead to better performance.

I haven't tried trtexec.
I use the builder.py to convert the torch model to engine. All the log is in log.txt file mentioned above.

About valseqlen It doesn't seem to be an error, I also get a message to increase memory when I use builder_valseqlen.py, and inference only cost 3ms.

pogevip commented 3 years ago

Hello @pogevip , tensorrt will query available tactic according to your workspace memory budget. And when some tactic is rejected because of the workspace limitation, you will see the warning.

How did you get the fp32/fp16 performance data? Can 8G workspace fix the fp16 performance drop issue? If you run trtexec with option --dumpProfile --separateProfileRun flags, we can see the layerwise runtime, this is helpful to debug the fp16 performance issue.

这个没有找到最快的融合方案,会不会是环境配置哪里有问题,显卡是RTX 6000,用的是nvcr.io/nvidia/tensorrt:20.10-py3,没有做任何环境变量的修改。Driver Version: 455.23.05 CUDA Version: 11.1

ttyio commented 3 years ago

Hello @pogevip , Is the attached code using fixed sequence length? If so I expected there is perf gap between var-seqlen. BTW, besides trtexec, you can also use nsightsystem to capture the perf data.