[ONNXParser] TensorRT Fails to Load ONNX Checkpoints with Separated Weight and Bias Files

Description

When attempting to load ONNX checkpoints that have separated weight and bias files (common for ONNX files larger than 2GB) in TensorRT, the framework searches for these element files (weights and biases) in the current working directory instead of the directory that contains the ONNX checkpoint. This behavior is inconsistent with the ONNX export process, which places all element files into a single directory.

Error Message

When trying to load the exported ONNX checkpoints, the following error message is encountered:

[11/20/2024-18:55:41] [TRT] [E] WeightsContext.cpp:178: Failed to open file: stdit3_simplified.onnx.data
[11/20/2024-18:55:41] [TRT] [E] In node -1 with name:  and operator:  (parseGraph): INVALID_GRAPH: Failed to import initializer
In node -1 with name:  and operator:  (parseGraph): INVALID_GRAPH: Failed to import initializer

Proposed Solution

To resolve this issue, move all element files to the current working directory. However, this approach can become quite cumbersome and disorganized, especially for large models with hundreds of element files.

# error
current-working-directory
└── save
    ├── stdit3.onnx
    ├── linear.1.bias
    ├── linear.1.weight
    └── ...

# proposed solution
current-working-directory
├── save
│   └── stdit3.onnx
├── linear.1.bias
├── linear.1.weight
└── ...

A more systematic solution might be needed to handle the organization of these files effectively.

Environment

TensorRT Version: 10.5.0
NVIDIA GPU: NVIDIA A100 80GB
NVIDIA Driver Version: 545.23.06
CUDA Version: 12.1
CUDNN Version: 8.9.3
Operating System: Ubuntu 20.04.1 LTS
Python Version: 3.9.19
PyTorch Version: 2.2.2

Relevant Files

Relevant Files: https://drive.google.com/drive/folders/1nLdYn8nDPs79ZKNx8TssdQSj4x-p4qfl?usp=sharing

stdit3_simplified.onnx: ONNX checkpoint
stdit3_simplified.onnx.data: Its data

Steps To Reproduce

Download ONNX checkpoint and its data from above Google Drive link, then structure working directory like this.

current-working-directory
└── save
├── stdit3_simplified.onnx
└── stdit3_simplified.onnx.data

Try to load with ONNX (success)

import onnx
path = "save/stdit3_simplified.onnx"
model = onnx.load(path)

Try to parse with TensorRT (error, error message as mentioned above)
```
import tensorrt as trt
```

path = "save/stdit3_simplified.onnx" trt_logger = trt.Logger() explicit_batch_flag = 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)

with trt.Builder(trt_logger) as builder, \ builder.create_network(explicit_batch_flag) as network, \ builder.create_builder_config() as config:

    parser = trt.OnnxParser(network, trt_logger)
    with open(path , 'rb') as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                logger.info(parser.get_error(error))
            return None
    print('Completed parsing ONNX model')

3. Re-structure working directory like this:

current-working-directory ├── save │ └── stdit3_simplified.onnx └── stdit3_simplified.onnx.data


4. Try to parse again with ONNX parser (**success**)

NVIDIA / TensorRT