When I use the TensorRT to infer MoblieNet in the INT8 mode,I meet the following errors.How can I solve the problems?

wyp19960713 commented 4 years ago

Description

[TensorRT] VERBOSE: Engine generation completed in 3.27319 seconds. [TensorRT] VERBOSE: Calculating Maxima [TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) [TensorRT] ERROR: ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) When the calibrator is activated,these errors occur. The following is my code of calibrator:

class MNISTEntropyCalibrator(trt.IInt8EntropyCalibrator):
    def __init__(self, cache_file, batch_size=1):
         trt.IInt8EntropyCalibrator.__init__(self)
         self.cache_file = cache_file
         # Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.
         self.data = load_data(data_list)
         self.batch_size = batch_size
         self.current_index = 0
         # Allocate enough memory for a whole batch.
         print(self.data[0].nbytes * self.batch_size)
         self.device_input = cuda.mem_alloc(self.data[0].nbytes * self.batch_size)
         # self.device_input = cuda.mem_alloc(2 << 30)
         print(self.device_input)

    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 10 == 0:
            print("Calibrating batch {:}, containing {:} samples".format(current_batch, self.batch_size))

        batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]

    def get_batch_size(self):
        return self.batch_size

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

Environment

TensorRT Version: 7.0.0.11 GPU Type: RTX 2070 Nvidia Driver Version: 440.82 CUDA Version: 10.0 CUDNN Version: 7.6.4 Operating System + Version:Ubuntu16.04 Python Version (if applicable): 3.7.6 TensorFlow Version (if applicable): 1.14 PyTorch Version (if applicable): Baremetal or Container (if container which image + tag):

Relevant Files

Steps To Reproduce

EFanZh commented 4 years ago

I think I am having the same problem converting a simple MNIST CNN model. Here is my code for reproducing (You need to install tensorflow and tf2onnx to run this):

from multiprocessing import Pool

import numpy
import tensorflow as tf
from tensorrt import Builder, BuilderFlag, IInt8EntropyCalibrator2, Logger, NetworkDefinitionCreationFlag, OnnxParser

(_TRAIN_IMAGES, _TRAIN_LABELS), _ = tf.keras.datasets.mnist.load_data()
_TRAIN_IMAGES = numpy.expand_dims(a=_TRAIN_IMAGES, axis=-1).astype(numpy.float32)
_TRAIN_LABELS = tf.keras.utils.to_categorical(y=_TRAIN_LABELS, num_classes=10)

class _Calibrator(IInt8EntropyCalibrator2):
    def __init__(self):
        super().__init__()

        self._batch_size = 1
        self._cache = None

    def get_batch(self, names, p_str=None):
        raise NotImplementedError

    def get_batch_size(self):
        return self._batch_size

    def read_calibration_cache(self):
        return self._cache

    def write_calibration_cache(self, cache):
        self._cache = cache

def _assert(value):
    if not value:
        raise AssertionError

def _get_frozen_graph_model():
    with tf.compat.v1.Session(graph=tf.Graph()) as session:
        model = tf.keras.Sequential(layers=[
            tf.keras.layers.Conv2D(filters=32, kernel_size=[3, 3], activation='relu', input_shape=[28, 28, 1]),
            tf.keras.layers.Conv2D(filters=64, kernel_size=[3, 3], activation='relu'),
            tf.keras.layers.MaxPooling2D(pool_size=[2, 2]),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(units=128, activation='relu'),
            tf.keras.layers.Dense(units=10, activation='softmax'),
        ])

        model.compile(optimizer=tf.keras.optimizers.SGD(), loss='binary_crossentropy', metrics=['accuracy'])
        model.fit(x=_TRAIN_IMAGES, y=_TRAIN_LABELS)

        return (tf.compat.v1.graph_util.convert_variables_to_constants(sess=session,
                                                                       input_graph_def=session.graph_def,
                                                                       output_node_names=[model.output.op.name]),
                model.input.name,
                model.output.name)

def _tf_to_onnx(graph_def, input_name, output_name):
    from onnx import defs
    from tf2onnx import tfonnx

    with tf.Graph().as_default() as graph:
        tf.import_graph_def(graph_def=graph_def, name='')

    onnx_model = tfonnx.process_tf_graph(tf_graph=graph,
                                         opset=defs.onnx_opset_version(),
                                         input_names=[input_name],
                                         output_names=[output_name])

    return onnx_model.make_model('').SerializeToString()

def _onnx_to_tensorrt(onnx_model):
    batch_size = 1

    with Logger() as logger, \
            Builder(logger) as builder, \
            builder.create_network(1 << int(NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) as network, \
            OnnxParser(network, logger) as onnx_parser:
        _assert(onnx_parser.parse(onnx_model))

        builder.max_batch_size = batch_size
        builder_config = builder.create_builder_config()
        optimization_profile = builder.create_optimization_profile()

        for i in range(network.num_inputs):
            input_tensor = network.get_input(i)
            shape = (batch_size,) + input_tensor.shape[1:]
            optimization_profile.set_shape(input=input_tensor.name, min=shape, opt=shape, max=shape)

        builder_config.add_optimization_profile(optimization_profile)
        builder_config.set_flag(BuilderFlag.INT8)
        builder_config.int8_calibrator = _Calibrator()

        cuda_engine = builder.build_engine(network, builder_config)

        _assert(cuda_engine)

        return cuda_engine

def _get_onnx_model():
    graph_def, input_name, output_name = _get_frozen_graph_model()

    return _tf_to_onnx(graph_def=graph_def, input_name=input_name, output_name=output_name)

def main():
    with Pool(processes=1) as pool:
        # Run in another process to make sure GPU memory used by TensorFlow gets freed.
        onnx_model = pool.apply(_get_onnx_model)

    _onnx_to_tensorrt(onnx_model=onnx_model)

if __name__ == '__main__':
    main()

rmccorm4 commented 4 years ago

Hi @wyp19960713 @EFanZh ,

TensorRT 7.0 had known issues with INT8 calibration on models with dynamic shape. Please upgrade to TensorRT 7.1, the issue should be fixed.

EFanZh commented 4 years ago

@rmccorm4

Actually, I am using TensorRT 7.1.3.4, the problem still exists.

rmccorm4 commented 4 years ago

@EFanZh ~if your model has dynamic shape (-1/None in any dimension), have you defined an optimization profile?~

~I believe it's required for the INT8 calibration, and it will use the kOPT shape for calibration: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#int8-calib-dynamic-shapes~

Edit: sorry I see the optimization profile in your code snippet now. I'll try to take a look tomorrow.

wyp19960713 commented 4 years ago

Hello, @EFanZh, I have aksed the same issue to NVIDIA ,he suggested me to increase the WORKSPACE_SIZE. But,when I haved used the GPU of RTX 2080 that maximum of GPU memory is 10989MB and I have setted the WORKSPACE_SIZE to the max value,another error occured,as following: [TensorRT] VERBOSE: Calculating Maxima [TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 11 (invalid argument) [TensorRT] ERROR: …/rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 11 (invalid argument) Is the reason for this error because the WORKSPACE_SIZE isn’t big enough? Can you help me to solve the problem?I will thank you very much! The following is my complete code,I don’t know whether it is wrong:

class MNISTEntropyCalibrator(trt.IInt8EntropyCalibrator):
    def __init__(self, cache_file, batch_size=1):
        # Whenever you specify a custom constructor for a TensorRT class,
        # you MUST call the constructor of the parent explicitly.
        trt.IInt8EntropyCalibrator.__init__(self)

        self.cache_file = cache_file

        # Every time get_batch is called, the next batch of size batch_size will be copied to the device and returned.
        self.data = load_data(data_list)
        self.batch_size = batch_size
        self.current_index = 0

        # Allocate enough memory for a whole batch.
        print(self.data[0].nbytes * self.batch_size)
        self.device_input = cuda.mem_alloc(self.data[0].nbytes * self.batch_size)
        # self.device_input = cuda.mem_alloc(2 << 30)
        print(self.device_input)

    # TensorRT passes along the names of the engine bindings to the get_batch function.
    # You don't necessarily have to use them, but they can be useful to understand the order of
    # the inputs. The bindings list is expected to have the same ordering as 'names'.
    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 10 == 0:
            print("Calibrating batch {:}, containing {:} images".format(current_batch, self.batch_size))

        batch = self.data[self.current_index:self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]

    def get_batch_size(self):
        return self.batch_size

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

EXPLICIT_BATCH = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
# Building engine
with trt.Builder(TRT_LOGGER) as builder, builder.create_network(EXPLICIT_BATCH) as network, builder.create_builder_config() as config, \
        trt.OnnxParser(network, TRT_LOGGER) as parser:
    builder.max_batch_size = 1
    builder.max_workspace_size = 1 << 33
    builder.int8_mode = True
    calibration_cache = "./mnist_calibration.cache"
    calib = MNISTEntropyCalibrator(cache_file=calibration_cache, batch_size=1)
    config_flags = 1 << int(trt.BuilderFlag.INT8)
    config.flags = config_flags
    config.int8_calibrator = calib
    with open("/home/dm/ATP-Audio-classification-training-pipeline/voice_recognition/checkpoints/mobilenetV2-gvlad28/mobilenetV2.onnx", 'rb') as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
    last_layer = network.get_layer(network.num_layers - 1)
    if not last_layer.get_output(0):
        network.mark_output(last_layer.get_output(0))
    print("network layers", network.num_layers)
    inputs = [network.get_input(i) for i in range(network.num_inputs)]
    outputs = [network.get_output(i) for i in range(network.num_outputs)]
    for inp in inputs:
        print(inp.shape[0])
    for oup in outputs:
        print(oup.shape[0])
    profile_intput = builder.create_optimization_profile()
    profile_intput.set_shape("input", (1, 257, 200, 1), (1, 257, 200, 1), (1, 257, 200, 1))
    config.add_optimization_profile(profile_intput)
    config.max_workspace_size = 1 << 33
    engine = builder.build_engine(network, config)
    with open("/home/dm/ATP-Audio-classification-training-pipeline/voice_recognition/checkpoints/mobilenetV2-gvlad28/mobilenetV2_int8.trt", "wb") as f:
        f.write(engine.serialize())

EFanZh commented 4 years ago

I don’t think this is because of workspace size. My model is a very simple network, it should’t cost must GPU memory. And yes, I have tried setting max_workspace_size and the problem has not gone away.

wanghr323 commented 4 years ago

I meet the same issue... when i run int8 onnx model （resnet50 and mobilenet）, it will print: [08/14/2020-05:01:56] [V] [TRT] Engine generation completed in 4.22749 seconds. [08/14/2020-05:01:56] [V] [TRT] Calculating Maxima [08/14/2020-05:01:56] [E] [TRT] ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory) [08/14/2020-05:01:56] [E] [TRT] ../rtSafe/safeRuntime.cpp (25) - Cuda Error in allocate: 2 (out of memory)

EFanZh commented 4 years ago

Hi @rmccorm4, Is there any updates on this one?

rmccorm4 commented 4 years ago

Hi @EFanZh ,

I'm not sure what errors you're experiencing, but I had to edit your code a lot.

get_batch needs to be implemented
Most other things raise this error RuntimeError: Unable to cast Python instance to C++ type (compile in debug mode for details), I believe also because of get_batch not being implemented correctly.

I made a sample Calibrator class from yours here and building the engine from your models fine for me:

# Calibrator.py
import os
import logging
import numpy as np
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

logging.basicConfig(level=logging.DEBUG,
                    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
                    datefmt="%Y-%m-%d %H:%M:%S")
logger = logging.getLogger(__name__)

class _Calibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, opt_shape=(1,28,28,1)):
        super().__init__()

        self._batch_size = opt_shape[0]
        num_samples = 1000
        self.batches = (np.random.random(opt_shape[1:]).astype(np.float32) for i in range(num_samples))
        self.device_input = cuda.mem_alloc(np.zeros(opt_shape, dtype=np.float32).nbytes)
        self.cache_file = "calibration.cache"

    def get_batch(self, names, p_str=None):
        try:
            batch = next(self.batches)
            cuda.memcpy_htod(self.device_input, batch)
            return [int(self.device_input)]
        except StopIteration:
            return None

    def get_batch_size(self):
        return self._batch_size

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                logger.info("Using calibration cache to save time: {:}".format(self.cache_file))
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            logger.info("Caching calibration data for future use: {:}".format(self.cache_file))
            f.write(cache)

And was able to do int8 calibration using the above calibrator class on your model with dynamic shape:

root@35e49d2833e6:/mnt/tensorrt-utils/int8/calibration# python3 onnx_to_tensorrt.py --explicit-batch --onnx=../../../tf_gpu.onnx --int8
2020-09-04 04:00:14 - __main__ - INFO - TRT_LOGGER Verbosity: Severity.ERROR
2020-09-04 04:00:27 - __main__ - INFO - Setting BuilderFlag.INT8
2020-09-04 04:00:27 - __main__ - DEBUG - === Network Description ===
2020-09-04 04:00:27 - __main__ - DEBUG - Input  0 | Name: conv2d_input:0    | Shape: (-1, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - DEBUG - Output 0 | Name: dense_1/Softmax:0 | Shape: (-1, -1)
2020-09-04 04:00:27 - __main__ - DEBUG - === Optimization Profiles ===
2020-09-04 04:00:27 - __main__ - DEBUG - conv2d_input:0 - OptProfile 0 - Min (1, 28, 28, 1) Opt (1, 28, 28, 1) Max (1, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - DEBUG - conv2d_input:0 - OptProfile 1 - Min (8, 28, 28, 1) Opt (8, 28, 28, 1) Max (8, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - DEBUG - conv2d_input:0 - OptProfile 2 - Min (16, 28, 28, 1) Opt (16, 28, 28, 1) Max (16, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - DEBUG - conv2d_input:0 - OptProfile 3 - Min (32, 28, 28, 1) Opt (32, 28, 28, 1) Max (32, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - DEBUG - conv2d_input:0 - OptProfile 4 - Min (64, 28, 28, 1) Opt (64, 28, 28, 1) Max (64, 28, 28, 1)
2020-09-04 04:00:27 - __main__ - INFO - Building Engine...
2020-09-04 04:00:32 - Calibrator - INFO - Caching calibration data for future use: calibration.cache
2020-09-04 04:00:38 - __main__ - INFO - Serializing engine to file: model.engine

The above script is from here, and I edited the calibrator used to be the class above instead of ImagenetCalibrator:

# onnx_to_tensorrt.py
# ...
        if args.int8:
            from Calibrator import _Calibrator
            config.int8_calibrator = _Calibrator()

Hopefully this helps as a reference for your issue as well @wyp19960713

EFanZh commented 4 years ago

@rmccorm4 Thank you for your response, I have run my script again, and the out of memory error has gone. Maybe some other program occupied my GPU memory at that time that cause that error.

rmccorm4 commented 4 years ago

No problem. I'm going to close this issue for now. Feel free to open a new issue if the solutions above don't work for you with the latest TensorRT version.

NVIDIA / TensorRT