WongKinYiu / yolor

implementation of paper - You Only Learn One Representation: Unified Network for Multiple Tasks (https://arxiv.org/abs/2105.04206)
GNU General Public License v3.0
1.98k stars 524 forks source link

Unable to convert yolor_p6 to TensorRT #26

Open borijang opened 3 years ago

borijang commented 3 years ago

Thanks for this repository! I managed to convert a trained model to ONNX using convert_onnx.py, but I can't manage to convert it to TensorRT for inference on a Jetson Xavier NX.

I have included the TensorRT (v7.1.3) build output below:

----------------------------------------------------------------
Input filename:   best.onnx
ONNX IR version:  0.0.6
Opset version:    11
Producer name:    pytorch
Producer version: 1.8
Domain:           
Model version:    0
Doc string:       
----------------------------------------------------------------
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:220: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [W] [TRT] onnx2trt_utils.cpp:246: One or more weights outside the range of INT32 was clamped
[06/10/2021-17:40:43] [I] [TRT] ModelImporter.cpp:135: No importer registered for op: ScatterND. Attempting to import as plugin.
[06/10/2021-17:40:43] [I] [TRT] builtin_op_importers.cpp:3659: Searching for plugin: ScatterND, plugin_version: 1, plugin_namespace: 
[06/10/2021-17:40:43] [E] [TRT] INVALID_ARGUMENT: getPluginCreator could not find plugin ScatterND version 1
ERROR: builtin_op_importers.cpp:3661 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[06/10/2021-17:40:43] [E] Failed to parse onnx file
[06/10/2021-17:40:43] [E] Parsing model failed
[06/10/2021-17:40:43] [E] Engine creation failed
[06/10/2021-17:40:43] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec # trtexec --onnx=best.onnx

Any ideas on how to solve the ScatterND issue? Seems like a broadcasting operation unsupported by TRT. Maybe using different opset version than 11, or rewriting all the lines that have elipsis indexing?

LukeAI commented 3 years ago

@borijang where is convert_onnx.py ? I can't find it!

LukeAI commented 3 years ago

It looks like this issue is known https://github.com/NVIDIA/TensorRT/issues/805

unclear if it will be implemented at some point or not.

It looks like if you can rewrite the bit of the model that uses subscript assignment you can get around this. https://paulbridger.com/posts/tensorrt-object-detection-quantized/

borijang commented 3 years ago

Sorry, thought that script was from this repo, but I must have reused it from somewhere else.

I am aware of the TRT issue, but I am not sure where the problem arises in the yolor code.

LukeAI commented 3 years ago

would you share the script here?

borijang commented 3 years ago

Sure, here you go:

class Params:
    def __init__(self, project_file):
        self.params = yaml.safe_load(open(project_file).read())

    def __getattr__(self, item):
        return self.params.get(item, None)

def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', type=str, default='models/yolor_p6.pt',
                        help="Path to input PyTorch model (.pth checkpoint)")
    parser.add_argument('--output', type=str, default='models/yolor_p6.onnx',
                        help="Desired path of converted ONNX model.")
    parser.add_argument('--config', type=str, default='config/coco.yaml', help="Path of the config file")
    parser.add_argument('--model', type=str, default='config/yolor_p6.cfg', help="Path of the model configuration")
    parser.add_argument('--width', type=int, default=1280, help="input width of the model to be exported (in pixels)")
    parser.add_argument('--height', type=int, default=1280, help="input height of the model to be exported (in pixels)")
    parser.add_argument('--batch-size', type=int, default=1, help="Batch size of the model to be exported (default=1)")

    return parser.parse_args()

if __name__ == '__main__':
    args = parse_arguments()
    params = Params(args.config)

    print(params.params)
    print(len(params.params))

    batch_size = 1
    device = select_device("cpu", batch_size=batch_size)
    # Load model
    model = Darknet(args.model).to(device)
    try:
        ckpt = torch.load(args.input, map_location=device)  # load checkpoint
        ckpt['model'] = {k: v for k, v in ckpt['model'].items() if model.state_dict()[k].numel() == v.numel()}
        model.load_state_dict(ckpt['model'], strict=False)
    except:
        load_darknet_weights(model, args.input)

    dummy_input = torch.randn((args.batch_size, 3, args.width, args.height), dtype=torch.float32).to(device)
    print("Exporting the model using onnx:")
    torch.onnx.export(model, dummy_input,
                      args.output,
                      verbose=False,
                      input_names=['data'],
                      opset_version=11)
LukeAI commented 3 years ago

@WongKinYiu do you know if there are other obstacles to exporting to TensorRT? Say we managed to deal with the scatterND issue by rewriting the lines with subscript assignment - are there other unsupported TensorRT ops?

WongKinYiu commented 3 years ago

I do not know how to use tensorrt, but one of our team member help to convert and deploy the model for our system.

LukeAI commented 3 years ago

hmm ok - would it be possible for us to get their code? Ability to export to TensorRT would be my #1 feature request!

satheeshkatipomu commented 3 years ago

@WongKinYiu , I am not sure whether it is only for me or not, but converting to onnx using models/export.py is not working. first I am getting an import error. but I think the code for loading the checkpoint needs to fixed.

satheeshkatipomu commented 3 years ago

@borijang , Below script is not working for me to convert checkpoint to onnx format. Have you done any changes to Darknet class? I am getting this error. Can you please help?

RuntimeError: Exporting the operator silu to ONNX opset version 11 is not supported. Please open a bug to request ONNX export support for the missing operator.
class Params:
    def __init__(self, project_file):
        self.params = yaml.safe_load(open(project_file).read())

    def __getattr__(self, item):
        return self.params.get(item, None)

def parse_arguments():
    parser = argparse.ArgumentParser()
    parser.add_argument('--input', type=str, default='models/yolor_p6.pt',
                        help="Path to input PyTorch model (.pth checkpoint)")
    parser.add_argument('--output', type=str, default='models/yolor_p6.onnx',
                        help="Desired path of converted ONNX model.")
    parser.add_argument('--config', type=str, default='config/coco.yaml', help="Path of the config file")
    parser.add_argument('--model', type=str, default='config/yolor_p6.cfg', help="Path of the model configuration")
    parser.add_argument('--width', type=int, default=1280, help="input width of the model to be exported (in pixels)")
    parser.add_argument('--height', type=int, default=1280, help="input height of the model to be exported (in pixels)")
    parser.add_argument('--batch-size', type=int, default=1, help="Batch size of the model to be exported (default=1)")

    return parser.parse_args()

if __name__ == '__main__':
    args = parse_arguments()
    params = Params(args.config)

    print(params.params)
    print(len(params.params))

    batch_size = 1
    device = select_device("cpu", batch_size=batch_size)
    # Load model
    model = Darknet(args.model).to(device)
    try:
        ckpt = torch.load(args.input, map_location=device)  # load checkpoint
        ckpt['model'] = {k: v for k, v in ckpt['model'].items() if model.state_dict()[k].numel() == v.numel()}
        model.load_state_dict(ckpt['model'], strict=False)
    except:
        load_darknet_weights(model, args.input)

    dummy_input = torch.randn((args.batch_size, 3, args.width, args.height), dtype=torch.float32).to(device)
    print("Exporting the model using onnx:")
    torch.onnx.export(model, dummy_input,
                      args.output,
                      verbose=False,
                      input_names=['data'],
                      opset_version=11)
JonathanSamelson commented 3 years ago

@borijang , Below script is not working for me to convert checkpoint to onnx format. Have you done any changes to Darknet class? I am getting this error. Can you please help?


RuntimeError: Exporting the operator silu to ONNX opset version 11 is not supported. Please open a bug to request ONNX export support for the missing operator.

I'm getting the same error using the script.

Using models/export.py (after commenting attempt_download), I get:

Traceback (most recent call last):
  File "models/export.py", line 21, in <module>
    model = torch.load(opt.weights, map_location=torch.device('cpu'))['model'].float()
AttributeError: 'collections.OrderedDict' object has no attribute 'float'
borijang commented 3 years ago

@satheeshkatipomu @JonathanSamelson I haven't modified Darknet. It works for me using the docker image nvcr.io/nvidia/pytorch:21.03-py3. It may be a pytorch issue, try upgrading it to 1.9.0.

JonathanSamelson commented 3 years ago

@borijang Perfect, it's working now using this docker image. Thanks a lot!

LukeAI commented 3 years ago

@JonathanSamelson what are you doing with your ONNX? have you managed to get tensorrt inference working?

JonathanSamelson commented 3 years ago

@LukeAI Sorry, I'm using ONNX for Python inference, I do not know for tensorrt 😕

TheConstant3 commented 3 years ago

@borijang thanks you! yolor_p6 exported to onnx!

but now i cannot export from onnx to tensorrt:

Traceback (most recent call last):
  File "onnx_to_trt.py", line 10, in <module>
    engine = backend.prepare(model, device='CUDA:0')
  File "/opt/conda/lib/python3.8/site-packages/onnx_tensorrt-7.2.2.3.0-py3.8.egg/onnx_tensorrt/backend.py", line 236, in prepare
  File "/opt/conda/lib/python3.8/site-packages/onnx_tensorrt-7.2.2.3.0-py3.8.egg/onnx_tensorrt/backend.py", line 68, in __init__
RuntimeError: While parsing node number 642:
/home/jenkins/workspace/OSS/L0_MergeRequest/oss/parsers/onnx/builtin_op_importers.cpp:4135 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"

also i tried convert with torch2trt and got warnings: Warning: Encountered known unsupported method torch.Tensor.expand_as and result of

x = torch.ones((1, 3, 640, 640)).cuda()
y = model(x)
y_trt = model_trt(x)
torch.max(torch.abs(y - y_trt))

was very large: tensor(784.60699, device='cuda:0', grad_fn=<MaxBackward1>)

like at this issue, i replaced expand_as(x) with expand(x.size()) and got error AttributeError: 'Parameter' object has no attribute '_trt'

@borijang @satheeshkatipomu @JonathanSamelson have you faced same problems? and how you solve it?

thanks in advance!

NNDam commented 3 years ago

I've successfull convert Yolor_x (yolor_csp_x_star.pt) from torch to ONNX, and also TensorRT with some modifications in models/models.py, which the most one is to avoid broadcasting (avoid using ScatterND plugin). You can see my modifications here (sr for my uncleaned code)

LukeAI commented 3 years ago

@NNDam thanks for sharing, I'll give it a test. Do you think it will likely work for the other models as well?

LukeAI commented 3 years ago

@NNDam I trained a quick model on a small private dataset and was unable to get it to run with the existing tensorrt c++ that I use with scaled-yolo - would you mind sharing your inference code for reference?

LukeAI commented 3 years ago

I notice that there are five getNbBindings() - the first is the right size for input - what are the other 4?

NNDam commented 3 years ago

@LukeAI there was unused 3 output layers (from 3 yolo detect layers), remove them by onnx_graphsurgeon, for example:

import onnx_graphsurgeon as gs
import onnx
from onnx import shape_inference

input_model_path = 'yolor_x.onnx'
output_model_path = 'yolor_x_cleaned.onnx'
onnx_module = shape_inference.infer_shapes(onnx.load(input_model_path))
while len(onnx_module.graph.output) != 1:
    for output in onnx_module.graph.output:
        if output.name != 'output':
            print('--> remove', output.name)
            onnx_module.graph.output.remove(output)
graph = gs.import_onnx(onnx_module)
graph.cleanup()
graph.toposort()
graph.fold_constants().cleanup()
onnx.save_model(gs.export_onnx(graph), output_model_path)
LukeAI commented 3 years ago

ok, thanks! have done as you advise but still struggling to get inference working... are you planning to release your inference code?

NNDam commented 3 years ago

Check my inference code here

dimleve commented 2 years ago

Hello all, I have managed to export into ONNX format a custom yolor model (9 classes) using @NNDam's code, my issue is that the output dimensions still hold the 85 (80+5) number of COCO classes, does anyone know what should I do in order to have the correct export? Thanks all for the very useful information in this thread!

JonathanSamelson commented 2 years ago

Hi @dimleve This issue I opened might be related to your problem even though I haven't figured it out yet.

LukeAI commented 2 years ago

@dimleve have you tried setting the number of classes in the yolo layers in the config? (and you will have to set the correct number of filters in the preceding convolutional, also)

ie. here and in the other two layers

filters should be (classes + 5)x3 = 42 for 9 classes

to be honest I'm not 100% sure if that's correct about the filters - that's how it was in older yolos but I'm not sure sure if these[control_channels] thing disrupts that

dimleve commented 2 years ago

Thanks bot @LukeAI and @JonathanSamelson, I will check and come back with my findings. I get the following error: copying a param with shape torch.Size([255]) from checkpoint, the shape in current model is torch.Size([42]). Not sure, but it seems that custom YOLOR model still has the 85 class setting (80+5 *3) although I am explicitly setting the nc = 9 in the data.yaml configuration, need to check further.

LukeAI commented 2 years ago

maybe the filters also have to be set to 42 at these point also? not sure https://github.com/WongKinYiu/yolor/blob/2fa3a318f364a4eb58721c90e5a978a78f0da58a/cfg/yolor_csp_x.cfg#L1433

but maybe if you already trained with the cfg file set with filters=255 etc. then that's what your checkpoint has, so you will just need to run with that many outputs? I guess your 9 classes will be represented in the first 9?

Looking in train.py - the model is created using the cfg file, not the yaml.

dimleve commented 2 years ago

maybe the filters also have to be set to 42 at these point also? not sure

https://github.com/WongKinYiu/yolor/blob/2fa3a318f364a4eb58721c90e5a978a78f0da58a/cfg/yolor_csp_x.cfg#L1433

but maybe if you already trained with the cfg file set with filters=255 etc. then that's what your checkpoint has, so you will just need to run with that many outputs? I guess your 9 classes will be represented in the first 9?

Looking in train.py - the model is created using the cfg file, not the yaml.

@LukeAI It seems that you are right, no need to modify anything and my classes are represented in the first 9, I will check further and verify, thank you!

htran170642 commented 2 years ago

Have you guys converted successfully and run yolor on Jetson Xavier? Thanks

kdy136811 commented 2 years ago

Hi all, I've successfully converted my custom yolor model to tensorrt and run on Jetson Xavier AGX! To fix the ScatterND issue, just upgrade Jetpack to the latest version 4.6. It contains tensorrt 8.0 (which supports ScatterND plugin).