NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.87k stars 2.14k forks source link

How to deploy an ONNX model with int8 calibration? #557

Closed le8888e closed 4 years ago

le8888e commented 4 years ago

Hello, I'm trying to do int8 calibration on an ONNX model with C++ API. I see there are samples of INT8 with caffemodel and ONNX MNIST. But how to quantize an ONNX model? Is there any samples or guidance to follow? Thank you.

rmccorm4 commented 4 years ago

Hi,

For INT8 calibration, you'll need to provide your own calibration data and implement an Int8 Calibrator. There's a decent example of some of those things here: https://github.com/rmccorm4/tensorrt-utils/tree/master/classification/imagenet

You can also try quantization-aware training (QAT) when training in the original framework, such as TF, and export this to ONNX with a tool like tf2onnx. I believe there is some support for these FakeQuant* nodes in both tf2onnx and TensorRT: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qat-tf

le8888e commented 4 years ago

@rmccorm4 Hi, thank you for your reply. It helps and I'm doing good with my work now. The int8 version of PSENet is only 30ms faster than FP32 version on V100, which is slower than I expected. I wish you could give more detailed instructions on int8 calibration. The BatchStream class you offer only supports images with .batch or .ppm format, which is definitely not user-friendly. Thank you for your help again : )

rmccorm4 commented 4 years ago

The link I referenced is actually expects .jpeg images to use out of the box, which is the imagenet dataset's format: https://www.github.com/rmccorm4/tensorrt-utils/tree/master/classification%2Fimagenet%2FImagenetCalibrator.py

But all it does is read the images into numpy arrays, and normalize them. You can similarly represent any kind of data as numpy arrays/matrices/etc., you'll just have to tweak the code a little bit.

le8888e commented 4 years ago

@rmccorm4 Yeaaah, but I'm working with C++ API : ) What I‘m trying to say is the develop guide and samples didn't cover certain cases. For example, I'm trying to doing int8 calibration on an ONNX model with C++ API. I can't figure out how to input .jpg image stream, and whether I should build int8 engine in onnx2TRTmodel() or loadTRTmodel() to read calibrationTable by your Document. There are many things we need to figure out ourselves. It would be better if there were more detailed instructions, but you guys still do a great job. Thank you!

rmccorm4 commented 4 years ago

Hi @le8888e ,

Since calibration is typically done offline, my personal recommendation is that using Python will be faster and easier, as there are many tools and libraries to load and normalize data (numpy, tensorflow, pytorch, pycuda, etc.).

You can calibrate using the python API, save the calibration cache to a file, and then load the calibration cache later in C++ if you wish. You can also do the calibration with the C++ API, but I just think it's a bit more complicated to handle the data, and typically requires setting up OpenCV and other libraries for the average use case.

I don't have much experience with the C++ API outside of inference (load engine, create context, infer inputs). Everything before the inference stage (parse model, create network, set builder flags, build engine, save engine to file, etc.) can typically done offline and therefore with the Python API (or even trtexec) and then serialize your engine to a file to load at runtime with C++ API.

le8888e commented 4 years ago

Hi @rmccorm4 ,

So in the phase of building an engine, we do config->setFlag(BuilderFlag::kINT8); and buildEngineWithConfig(network, config); then save engine to file.

In the phase of inference, we invoke the engine on local disk. If I follow these steps above, will the inference run in int8 mode? Cuz in my experiment int8 runs slightly faster than fp32. I'm wondering if some steps are missing.
Here is my main code, would you mind check it out? : ) tensorRT.txt And in which phase does TensorRT invoke calibrationTable on local disk? When create a calibrator, readCalibrationCache() will be called. But in the phase of inference, create engine from file will not create a calibrator, so is the calibration data saved in engine file?

Thank you.

CallmeZhangChenchen commented 4 years ago

I have a question,

To run the AlexNet network on DLA using trtexec in INT8 mode, issue: ./trtexec --deploy=data/AlexNet/AlexNet_N2.prototxt --output=prob --useDLACore=1 --int8 --allowGPUFallback the official example doing, can it be directly converted to int8, then add command --saveEngine=AlexNet.trt ,it means AlexNet.trt Already is a quantified model?

Why do I need my own data set and calibration table, or the official command --int8 is only official test

llk2why commented 4 years ago

Hi,

For INT8 calibration, you'll need to provide your own calibration data and implement an Int8 Calibrator. There's a decent example of some of those things here: https://github.com/rmccorm4/tensorrt-utils/tree/master/classification/imagenet

You can also try quantization-aware training (QAT) when training in the original framework, such as TF, and export this to ONNX with a tool like tf2onnx. I believe there is some support for these FakeQuant* nodes in both tf2onnx and TensorRT: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#qat-tf

Hi @rmccorm4 , I've generated an offline caliabration table with your python scripts, and here comes a question, how can I load the calibration table with C++ API without implementing an Int8 Calibrator since there exists a cache? My code has to run on different platforms, so I cannot just export offline engines with trtexec. Looking forward to your reply, thank you.

rmccorm4 commented 4 years ago

Hi @llk2why ,

(For others) If your use case can generate engines offline, you can just read in the calibration cache using trtexec:

trtexec --fp16 --int8 --calib=<calibration_cache_file> --onnx=model.onnx

My code has to run on different platforms, so I cannot just export offline engines with trtexec

You can implement a very simple/minimal calibrator, where I believe the only methods you actually need to implement are readCalibrationCache and writeCalibrationCache.

For every other method, I believe you can just give a dummy implementation of to be clear that it expects a pre-calibrated cache file:

{
    throw std::runtime_error{"Not Implemented"};
}

Some extra notes:

  1. You may also be able to just use/call the trtexec source code in your application: https://github.com/NVIDIA/TensorRT/blob/master/samples/opensource/trtexec/trtexec.cpp
  2. Sample calibrator implementation here: https://github.com/NVIDIA/TensorRT/blob/master/samples/common/EntropyCalibrator.h. The type of calibrator used/implemented here should not matter, as the scales are already fixed in the calibration cache file and are simply read in by your calibrator implementation to set the dynamic ranges of each tensor to (-scale, +scale).
  3. Calibrator implementation used by trtexec here: https://github.com/NVIDIA/TensorRT/blob/master/samples/common/sampleEngines.cpp#L157-L252

You should be able to take out the calibrator parts from some of these links above and use your calibration cache file.

llk2why commented 4 years ago

@rmccorm4 Thank you, it seems to make sense, I will give it a try right now.

llk2why commented 4 years ago

@rmccorm4 It works, but apart from these:

You can implement a very simple/minimal calibrator, where I believe the only methods you actually need to implement are readCalibrationCache and writeCalibrationCache.

getBatchSize got called as well, so I just implemented with return 1;, I am not sure whether the batch_size has any side effect.

rmccorm4 commented 4 years ago

Thanks for the update @llk2why. Going to close this as the description and resolution seemed to work for you.

cocoyen1995 commented 4 years ago

@rmccorm4 Hi, thank you for your reply. It helps and I'm doing good with my work now. The in8 version of PSENet is only 30ms faster than FP32 version on V100, which is slower than I expected. I wish you could give more detailed instructions on int8 calibration. The BatchStream class you offer only supports images with .batch or .ppm format, which is definitely not user-friendly. Thank you for your help again : )

@rmccorm4 Yeaaah, but I'm working with C++ API : ) What I‘m trying to say is the develop guide and samples didn't cover certain cases. For example, I'm trying to doing int8 calibration on an ONNX model with C++ API. I can't figure out how to input .jpg image stream, and whether I should build int8 engine in onnx2TRTmodel() or loadTRTmodel() to read calibrationTable by your Document. There are many things we need to figure out ourselves. It would be better if there were more detailed instructions, but you guys still do a great job. Thank you!

Hi @le8888e ,

I'm trying to convert an onnx model(UNET in my case) to INT8 engine as you did before with C++. I've searched for a while but didn't find any example of making calibration file with c++(only found many with python). Since I'm on a Windows computer, cannot installing tensorrt with python, those of work can't be done as I expected... I want to know how you make calibration file with reading images and complete the conversion of INT8 engine you've done with C++. Could you please share how to do so in detail? Thanks in advance for any help or advice!

le8888e commented 4 years ago

Hi @cocoyen1995 ,

First you need to implement Class int8EntroyCalibrator like in this file tensorRT.txt

Then in the step of convert onnx model to TRT engine, you need to declare an instance of int8EntroyCalibrator like calibrator = new int8EntroyCalibrator(maxBatchSize, calibration_images, calibration_table_save_path);

Then pass calibrator to config->setInt8Calibrator(calibrator);

config is declared by auto config = SampleUniquePtr(builder->createBuilderConfig());

Remember you have to do exactly the same image preprocess when calibration and inference. You can refer to function prepareImage in the file I uploaded.

For more details, you can refer to TensorRT's official INT8 example code.

Hope this helps. Feel free if you wanna speak Chinese cuz my English is not that good and may make you feel confused lol

cocoyen1995 commented 4 years ago

Hi, @le8888e ,

Thanks for your quick reply! I'll try that out and let you know the result ^^ Have a nice weekend! 如果過程中有其他問題我再提出,先說聲謝謝囉!

cocoyen1995 commented 4 years ago

Hi @le8888e ,

不好意思又打擾了。我剛剛仔細看了下您提供的程式碼,針對prepareImage()這個函式的部分想請問一下, 如果我的模型訓練時是用HWC的順序,截至目前我完成的推論程式也是用HWC,這樣有需要換成CHW嗎? 另外,若原本我的模型吃的輸入是使用1/255.0做歸一化,一樣要改成1/127.5去做嗎? (還是方便的話能跟您加個微信進一步請教嗎? 我的id是cocococoyenyen,想說這部分的討論好像太細節了不確定在這邊發問適不適合ˊˋ

250zhanghu commented 3 years ago

Hi @le8888e ,

不好意思又打擾了。我剛剛仔細看了下您提供的程式碼,針對prepareImage()這個函式的部分想請問一下, 如果我的模型訓練時是用HWC的順序,截至目前我完成的推論程式也是用HWC,這樣有需要換成CHW嗎? 另外,若原本我的模型吃的輸入是使用1/255.0做歸一化,一樣要改成1/127.5去做嗎? (還是方便的話能跟您加個微信進一步請教嗎? 我的id是cocococoyenyen,想說這部分的討論好像太細節了不確定在這邊發問適不 I have the same issue, should I convert the inputIOFormats from chw to hwc? and how to do? I will very appreciate you if you can give me a help.

cocoyen1995 commented 3 years ago

Hi @250zhanghu ,

My model was trained with the sequence of HWC, so actually I just inference with this format. If your model was trained with CHW, then you have to run the prepareImage() to convert the input data to CHW format.

By the way, during the trial and error process of calibration @le8888e discussed with me, there's a value may cause different calibration inference result, which is the beta value of opencv's function convertTo()

bool int8EntroyCalibrator::getBatch(void **bindings, const char **names, int nbBindings) 
{
    if (imageIndex + batchSize > int(imgPaths.size()))
        return false;
    std::cout << "*** calibrate with convertTO beta = " << beta << " ***" << std::endl;
    float* ptr = batchData;
    for (size_t j = imageIndex; j < imageIndex + batchSize; ++j)
    {
        std::cout << "loading image " << imgPaths[j] << "  " << (j + 1)*100. / imgPaths.size() << "%" << std::endl;

        cv::Mat img = cv::imread(imgPaths[j]); 
        img.convertTo(img, CV_32FC(MD_size[2]), 1 / 255.0, beta);  ///// here
        memcpy(ptr, img.ptr<float>(0), MD_size[0]* MD_size[1] * sizeof(float));
    }
    imageIndex += batchSize;
    CHECK(cudaMemcpy(deviceInput, batchData, inputCount * sizeof(float), cudaMemcpyHostToDevice));
    bindings[0] = deviceInput;
    return true;
}

It seems that the default value of beta is 0, but in some of my cases, I have to set it to -0.4 (darker) or 2.5 (brighter).

Hope this answer your question, good luck! 💪

250zhanghu commented 3 years ago

@cocoyen1995
thanks very much! I found you can speak Chinese,thats pretty good! 您好,我是利用tensorflow训练一个分类网络,训练时输入通道是用的HWC形式,然后利用tf2onnx将模型文件转换为onnx格式。利用tensorrt中bin目录下的 trtexec 将onnx转换为trt,并对动态维度进行了设置。 转化之后我发现 --InputIOFormats 的默认值是 chw形式的,,但转化过程中没有报错,不知道这是为什么呢?

cocoyen1995 commented 3 years ago

Hi @250zhanghu ,

不好意思我沒有用trtexec轉換過模型,所以不太清楚那個參數的意義跟它轉化的流程ˊˋ 如果轉換過程沒報錯的話,它推論的格式也是用CHW嗎(? 不知道有沒有可能是它自己幫你把模型通道順序轉換了(?

我轉檔(onnx -> trt)是這樣轉的:

int ONNX2TRT(json Jconfig, char* onnxFileName, char* trtFileName, int batchSize, int run_fp16)
{
    if (_access(onnxFileName, 02) != 0)
    {
        return -1;
    }

    // create the builder
    IBuilder* builder = createInferBuilder(sample::gLogger.getTRTLogger());
    if (builder == nullptr)
    {
        return -2;
    }

    // Now We Have BatchSize Here
    nvinfer1::INetworkDefinition* network = builder->createNetworkV2(batchSize);

    nvinfer1::IBuilderConfig* config = builder->createBuilderConfig();

    auto parser = nvonnxparser::createParser(*network, sample::gLogger.getTRTLogger());

    if (!parser->parseFromFile(onnxFileName, static_cast<int>(sample::gLogger.getReportableSeverity())))
    {
        return -3;
    }
    builder->setMaxBatchSize(batchSize);
    builder->setMaxWorkspaceSize(2_GiB);
    config->setMaxWorkspaceSize(2_GiB);
    if (run_fp16)
    {
        std::cout << "***USING FP16***\n";
        config->setFlag(BuilderFlag::kFP16);
    }
    else
    {
        std::cout << "***USING INT8***\n";
        config->setFlag(BuilderFlag::kINT8);

        // provided by @le8888e at https://github.com/NVIDIA/TensorRT/issues/557
        std::string calibration_imgs_list = Jconfig["cali_image_path"].get<std::string>();       //file to save calibraiton images' path, one sample a line
        //std::string calibration_table_save_path = Jconfig["cali_save_path"].get<std::string>();  //path to save calibration table
        std::string calibration_table_save_path = "./secret_path/cache_data.cache";  //path to save calibration table
        std::vector<int> MD_size = Jconfig["model_input_size"];
        float beta = Jconfig["int8_beta"].get<float>();
        std::cout << "beta:" << beta << std::endl;

        int8EntroyCalibrator *calibrator = nullptr;
        calibrator = new int8EntroyCalibrator(1, calibration_imgs_list, calibration_table_save_path, MD_size, beta);
        config->setInt8Calibrator(calibrator);

    }

    samplesCommon::enableDLA(builder, config, -1);

    ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
    if (!engine)
    {
        return -4;
    }

    // we can destroy the parser
    parser->destroy();

    // serialize the engine, then close everything down
    IHostMemory* trtModelStream = engine->serialize();

    engine->destroy();
    network->destroy();
    builder->destroy();

    if (!trtModelStream)
    {
        return -5;
    }

    ofstream ofs(trtFileName, std::ios::out | std::ios::binary);
    ofs.write((char*)(trtModelStream->data()), trtModelStream->size());
    ofs.close();
    trtModelStream->destroy();

    return 0;
}

這樣的轉換,.onnx跟.trt都是HWC的通道順序 (沒辦法回答你的問題不好意思 😢

250zhanghu commented 3 years ago

@cocoyen1995 您太客气了,非常感谢,我会尝试您的方法的~ 谢谢!

htran170642 commented 3 years ago

onnxFileName @cocoyen1995 Could you please share the code file to convert onnx model with int8 via my gmail: hieptientran196@gmail.com ? Thank you so much.