NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.47k stars 2.1k forks source link

Poor inference results of TensorRT 8.6.3 when running INT8-calibration on GPU RTX3090 #3708

Closed bernardrb closed 4 months ago

bernardrb commented 5 months ago

Description

I tried to run EfficientViT-SAM on a RTX 3090, but quantization to 8-bit gave severely distorted results. Unsure whether issue has to do with calibration code, or nature of quantization. I've altered the ImageBatcher to work with my model in mind.

Image below was quantized on a 10000 images from the Meta SAM dataset,

INT8 2024-03-08_16-15-30

FP16 2024-03-08_16-14-04

Adapted from samples/EfficientDet,

class EngineCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, cache_file):
        """
        :param cache_file: The location of the cache file.
        """
        super().__init__()
        self.cache_file = cache_file
        self.image_batcher = None
        self.batch_allocation = None
        self.batch_generator = None

    def set_image_batcher(self, image_batcher: ImageBatcher):
        """
        Define the image batcher to use, if any. If using only the cache file, an image batcher doesn't need
        to be defined.
        :param image_batcher: The ImageBatcher object
        """
        self.image_batcher = image_batcher
        size = int(np.dtype(self.image_batcher.dtype).itemsize * np.prod(self.image_batcher.shape))
        self.batch_allocation = common.cuda_call(cudart.cudaMalloc(size))
        self.batch_generator = self.image_batcher.get_batch()

    def get_batch_size(self):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Get the batch size to use for calibration.
        :return: Batch size.
        """
        if self.image_batcher:
            return self.image_batcher.batch_size
        return 1

    def get_batch(self, names):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Get the next batch to use for calibration, as a list of device memory pointers.
        :param names: The names of the inputs, if useful to define the order of inputs.
        :return: A list of int-casted memory pointers.
        """
        if not self.image_batcher:
            return None
        try:
            batch, _, _ = next(self.batch_generator)
            log.info("Calibrating image {} / {}".format(self.image_batcher.image_index, self.image_batcher.num_images))
            common.memcpy_host_to_device(self.batch_allocation, np.ascontiguousarray(batch))
            return [int(self.batch_allocation)]
        except StopIteration:
            log.info("Finished calibration batches")
            return None

    def read_calibration_cache(self):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Read the calibration cache file stored on disk, if it exists.
        :return: The contents of the cache file, if any.
        """
        if self.cache_file is not None and os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                log.info("Using calibration cache file: {}".format(self.cache_file))
                return f.read()

    def write_calibration_cache(self, cache):
        """
        Overrides from trt.IInt8EntropyCalibrator2.
        Store the calibration cache to a file on disk.
        :param cache: The contents of the calibration cache to store.
        """
        if self.cache_file is None:
            return
        with open(self.cache_file, "wb") as f:
            log.info("Writing calibration cache data to: {}".format(self.cache_file))
            f.write(cache)

class EngineBuilder:
    """
    Parses an ONNX graph and builds a TensorRT engine from it.
    """

    def __init__(self, verbose=False, workspace=8):
        """
        :param verbose: If enabled, a higher verbosity level will be set on the TensorRT logger.
        :param workspace: Max memory workspace to allow, in Gb.
        """
        self.trt_logger = trt.Logger(trt.Logger.INFO)
        if verbose:
            self.trt_logger.min_severity = trt.Logger.Severity.VERBOSE

        trt.init_libnvinfer_plugins(self.trt_logger, namespace="")

        self.builder = trt.Builder(self.trt_logger)
        self.config = self.builder.create_builder_config()
        self.config.max_workspace_size = workspace * (2 ** 30)

        [self.network](http://self.network/) = None
        self.parser = None

    def create_network(self, onnx_path, batch_size, dynamic_batch_size=None):
        """
        Parse the ONNX graph and create the corresponding TensorRT network definition.
        :param onnx_path: The path to the ONNX graph to load.
        :param batch_size: Static batch size to build the engine with.
        :param dynamic_batch_size: Dynamic batch size to build the engine with, if given,
        batch_size is ignored, pass as a comma-separated string or int list as MIN,OPT,MAX
        """
        network_flags = (1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))

        [self.network](http://self.network/) = self.builder.create_network(network_flags)
        self.parser = trt.OnnxParser(self.network, self.trt_logger)

        onnx_path = os.path.realpath(onnx_path)
        with open(onnx_path, "rb") as f:
            if not self.parser.parse(f.read()):
                log.error("Failed to load ONNX file: {}".format(onnx_path))
                for error in range(self.parser.num_errors):
                    log.error(self.parser.get_error(error))
                sys.exit(1)

        log.info("Network Description")

        inputs = [self.network.get_input(i) for i in range(self.network.num_inputs)]
        profile = self.builder.create_optimization_profile()
        dynamic_inputs = False
        for input in inputs:
            log.info("Input '{}' with shape {} and dtype {}".format(input.name, input.shape, input.dtype))
            if input.shape[0] == -1:
                dynamic_inputs = True
                if dynamic_batch_size:
                    if type(dynamic_batch_size) is str:
                        dynamic_batch_size = [int(v) for v in dynamic_batch_size.split(",")]
                    assert len(dynamic_batch_size) == 3
                    min_shape = [dynamic_batch_size[0]] + list(input.shape[1:])
                    opt_shape = [dynamic_batch_size[1]] + list(input.shape[1:])
                    max_shape = [dynamic_batch_size[2]] + list(input.shape[1:])
                    profile.set_shape(input.name, min_shape, opt_shape, max_shape)
                    log.info("Input '{}' Optimization Profile with shape MIN {} / OPT {} / MAX {}".format(
                        input.name, min_shape, opt_shape, max_shape))
                else:
                    shape = [batch_size] + list(input.shape[1:])
                    profile.set_shape(input.name, shape, shape, shape)
                    log.info("Input '{}' Optimization Profile with shape {}".format(input.name, shape))
        if dynamic_inputs:
            self.config.add_optimization_profile(profile)

        outputs = [self.network.get_output(i) for i in range(self.network.num_outputs)]
        for output in outputs:
            log.info("Output '{}' with shape {} and dtype {}".format(output.name, output.shape, output.dtype))

    def set_mixed_precision(self):
        """
        Experimental precision mode.
        Enable mixed-precision mode. When set, the layers defined here will be forced to FP16 to maximize
        INT8 inference accuracy, while having minimal impact on latency.
        """
        self.config.set_flag(trt.BuilderFlag.STRICT_TYPES)

        # All convolution operations in the first four blocks of the graph are pinned to FP16.
        # These layers have been manually chosen as they give a good middle-point between int8 and fp16
        # accuracy in COCO, while maintining almost the same latency as a normal int8 engine.
        # To experiment with other datasets, or a different balance between accuracy/latency, you may
        # add or remove blocks.
        for i in range(self.network.num_layers):
            layer = self.network.get_layer(i)
            if layer.type == trt.LayerType.CONVOLUTION and any([
                    # AutoML Layer Names:
                    "/stem/" in layer.name,
                    "/blocks_0/" in layer.name,
                    "/blocks_1/" in layer.name,
                    "/blocks_2/" in layer.name,
                    # TFOD Layer Names:
                    "/stem_conv2d/" in layer.name,
                    "/stack_0/block_0/" in layer.name,
                    "/stack_1/block_0/" in layer.name,
                    "/stack_1/block_1/" in layer.name,
                ]):
                self.network.get_layer(i).precision = trt.DataType.HALF
                log.info("Mixed-Precision Layer {} set to HALF STRICT data type".format(layer.name))

    def create_engine(self, engine_path, precision, calib_input=None, calib_cache=None, calib_num_images=5000,
                      calib_batch_size=8):
        """
        Build the TensorRT engine and serialize it to disk.
        :param engine_path: The path where to serialize the engine to.
        :param precision: The datatype to use for the engine, either 'fp32', 'fp16', 'int8', or 'mixed'.
        :param calib_input: The path to a directory holding the calibration images.
        :param calib_cache: The path where to write the calibration cache to, or if it already exists, load it from.
        :param calib_num_images: The maximum number of images to use for calibration.
        :param calib_batch_size: The batch size to use for the calibration process.
        """
        engine_path = os.path.realpath(engine_path)
        engine_dir = os.path.dirname(engine_path)
        os.makedirs(engine_dir, exist_ok=True)
        log.info("Building {} Engine in {}".format(precision, engine_path))

        inputs = [self.network.get_input(i) for i in range(self.network.num_inputs)]

        if precision in ["fp16", "int8", "mixed"]:
            if not self.builder.platform_has_fast_fp16:
                log.warning("FP16 is not supported natively on this platform/device")
            self.config.set_flag(trt.BuilderFlag.FP16)
        if precision in ["int8", "mixed"]:
            if not self.builder.platform_has_fast_int8:
                log.warning("INT8 is not supported natively on this platform/device")
            self.config.set_flag(trt.BuilderFlag.INT8)
            self.config.int8_calibrator = EngineCalibrator(calib_cache)
            if calib_cache is None or not os.path.exists(calib_cache):
                calib_shape = [calib_batch_size] + list(inputs[0].shape[1:])
                calib_dtype = trt.nptype(inputs[0].dtype)
                self.config.int8_calibrator.set_image_batcher(
                    ImageBatcher(calib_input, calib_shape, calib_dtype, max_num_images=calib_num_images,
                                 exact_batches=True, shuffle_files=True))

        engine_bytes = None
        try:
            engine_bytes = self.builder.build_serialized_network(self.network, self.config)
        except AttributeError:
            engine = self.builder.build_engine(self.network, self.config)
            engine_bytes = engine.serialize()
            del engine
        assert engine_bytes
        with open(engine_path, "wb") as f:
            log.info("Serializing engine to file: {:}".format(engine_path))
            f.write(engine_bytes)`

class ImageBatcher: def init( self, input, shape, dtype, max_num_images=None, exact_batches=False, preprocessor="EfficientDet", shuffle_files=False, ):

    input = os.path.realpath(input)
    self.images = []

    extensions = [".jpg", ".jpeg", ".png", ".bmp"]

    def is_image(path):
        return (
            os.path.isfile(path) and os.path.splitext(path)[1].lower() in extensions
        )

    if os.path.isdir(input):
        self.images = [
            os.path.join(input, f)
            for f in os.listdir(input)
            if is_image(os.path.join(input, f))
        ]
        self.images.sort()
        if shuffle_files:
            random.seed(47)
            random.shuffle(self.images)
    elif os.path.isfile(input):
        if is_image(input):
            self.images.append(input)
    self.num_images = len(self.images)
    if self.num_images < 1:
        print("No valid {} images found in {}".format("/".join(extensions), input))
        sys.exit(1)

    # Handle Tensor Shape
    self.dtype = dtype
    self.shape = shape
    assert len(self.shape) == 4
    self.batch_size = shape[0]
    assert self.batch_size > 0
    self.format = "NHWC"
    self.height = self.shape[1]
    self.width = self.shape[2]
    assert all([self.format, self.width > 0, self.height > 0])

    # Adapt the number of images as needed
    if max_num_images and 0 < max_num_images < len(self.images):
        self.num_images = max_num_images
    if exact_batches:
        self.num_images = self.batch_size * (self.num_images // self.batch_size)
    if self.num_images < 1:
        print("Not enough images to create batches")
        sys.exit(1)
    self.images = self.images[0 : self.num_images]

    # Subdivide the list of images into batches
    self.num_batches = 1 + int((self.num_images - 1) / self.batch_size)
    self.batches = []
    for i in range(self.num_batches):
        start = i * self.batch_size
        end = min(start + self.batch_size, self.num_images)
        self.batches.append(self.images[start:end])

    # Indices
    self.image_index = 0
    self.batch_index = 0

    self.preprocessor = preprocessor

def preprocess_image(self, image_path):
    img_size = 512

    x = cv2.cvtColor(cv2.imread(image_path), cv2.COLOR_BGR2RGB)

    pixel_mean = [123.675 / 255, 116.28 / 255, 103.53 / 255]
    pixel_std = [58.395 / 255, 57.12 / 255, 57.375 / 255]

    x = torch.tensor(x)
    resize_transform = SamResize(img_size)
    x = resize_transform(x).float() / 255
    x = transforms.Normalize(mean=pixel_mean, std=pixel_std)(x)

    h, w = x.shape[-2:]
    th, tw = img_size, img_size
    assert th >= h and tw >= w
    x = F.pad(x, (0, tw - w, 0, th - h), value=0).unsqueeze(0).numpy()

    return x, None

def get_batch(self):
    """
    Retrieve the batches. This is a generator object, so you can use it within a loop as:
    for batch, images in batcher.get_batch():
       ...
    Or outside of a batch with the next() function.
    :return: A generator yielding three items per iteration: a numpy array holding a batch of images, the list of
    paths to the images loaded within this batch, and the list of resize scales for each image in the batch.
    """
    for i, batch_images in enumerate(self.batches):
        batch_data = np.zeros(self.shape, dtype=self.dtype)

        batch_scales = [None] * len(batch_images)
        for i, image in enumerate(batch_images):
            self.image_index += 1
            batch_data[i], batch_scales[i] = self.preprocess_image(image)
        self.batch_index += 1
        yield batch_data, batch_images, batch_scales

class SamResize: def init(self, size: int) -> None: self.size = size

def __call__(self, image: torch.Tensor) -> torch.Tensor:
    h, w, _ = image.shape
    long_side = max(h, w)
    if long_side != self.size:
        return self.apply_image(image)
    else:
        return image.permute(2, 0, 1)

def apply_image(self, image: torch.Tensor) -> torch.Tensor:
    """
    Expects a torch tensor with shape HxWxC in float format.
    """

    target_size = self.get_preprocess_shape(
        image.shape[0], image.shape[1], self.size
    )
    return resize(image.permute(2, 0, 1), target_size)

@staticmethod
def get_preprocess_shape(
    oldh: int, oldw: int, long_side_length: int
) -> tuple[int, int]:
    """
    Compute the output size given input size and target long side length.
    """
    scale = long_side_length * 1.0 / max(oldh, oldw)
    newh, neww = oldh * scale, oldw * scale
    neww = int(neww + 0.5)
    newh = int(newh + 0.5)
    return (newh, neww)

def __repr__(self) -> str:
    return f"{type(self).__name__}(size={self.size})"`

Environment

TensorRT Version: 8.6.3

NVIDIA GPU: RTX 3090

NVIDIA Driver Version: 525.147.05

CUDA Version: 12.0

CUDNN Version: 9.0.0

Operating System: Ubuntu 22.04

Python Version (if applicable): 3.10.12

PyTorch Version (if applicable): 2.2.1

Baremetal or Container (if so, version): nvcr.io/nvidia/tensorrt:24.02-py3

Relevant Files

Model link: https://github.com/mit-han-lab/efficientvit/blob/master/applications/sam.md

All files: https://drive.google.com/drive/folders/16Qe72Kf1SmXobz9X1YKuDB8pDGQVAheK?usp=sharing

Includes minimal setup, source code, logs, results

Steps To Reproduce

Commands or scripts:

For convenience, docker compose up build_engine docker compose up inference

scripts/quantize.sh to build engine scripts/inference.sh to run inference

Have you tried the latest release?: Yes.

zerollzeng commented 5 months ago

For transformer-based model, PTQ cannot provide a good accuracy, you can try QAT.

bernardrb commented 5 months ago

For transformer-based model, PTQ cannot provide a good accuracy, you can try QAT.

Can you elaborate on this?

Considering EfficientViT, avoids SoftMax activations, by using reLU, and convolutional layers, I would have thought that it would not experience the same performance hit. But, I am just theorizing, and I guess the results speak for themselves.

Are you suggesting to use the pytorch integration for QAT? Could a conservative mixed-precision work without the need for training in your opinion?

What are the obstacles to trying PTQ-methods in tensorrt-llm on a vision model? e.g. smoothquant.

ApolloRay commented 5 months ago

For transformer-based model, PTQ cannot provide a good accuracy, you can try QAT.

你好,请问关于扩散模型的量化您是否有了解呢?基于A10卡,量化sdxl-turbo模型的时候,unet的推理耗时比不量化的时间更长了。(仅量化了nn.Linear,如果卷积层和线性层都量化22G显存不太够)

In English, Hello, do you know anything about the quantification of diffusion models? Based on the A10 card, when quantizing the sdxl-turbo model, unet's inference takes longer than without quantification. (Only nn.Linear is quantized. If both the convolutional layer and the linear layer are quantized, 22G video memory is not enough)

zerollzeng commented 5 months ago

@bernardrb please refer to https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#explicit-implicit-quantization

ttyio commented 4 months ago

@bernardrb @ApolloRay , we have a SD INT8 sample in https://github.com/NVIDIA/TensorRT/tree/release/10.0/demo/Diffusion, not sure if this helps.