Exported custom segformer model with pretrained weights ("nvidia/segformer-b0-finetuned-ade-512-512") to ONNX takes longer for inference compare to PyTorch!

installed packages:

python=3.10.0
onnxruntime-gpu=1.17.0
pytorch =2.2.2
pytorch-cuda=11.8
pytorch-lightning =1.9.3
transformers=4.41.2

I have trained SegFormer model with pretrained weights as mentioned with classifier layer:

from transformers import SegformerForSemanticSegmentation

class SegformerFinetuner(LightningModule):
    def __init__(self, model_name="nvidia/segformer-b0-finetuned-ade-512-512", learning_rate=1e-4):
        super().__init__()
        self.model = SegformerForSemanticSegmentation.from_pretrained(model_name) # input channel --> 3; output channel --> 150
        self.classifier = nn.Sequential(
            nn.Conv2d(150, 64, kernel_size=3, padding=1),
            nn.Dropout(0.1),
            nn.ReLU(),

            nn.Conv2d(64, 1, kernel_size=1), 
        )

After training, I exported it to the onnx with:

inference_model = SegformerFinetuner() 
inference_model.load_state_dict(torch.load(".\\pt_model\\trained_model.pt"))
inference_model.cuda() 
inference_model.eval() 

dummy_input = torch.randn(1, 3, 384, 384, requires_grad=True).cuda()

dynamic_axes={
        "input": {0: "batch", 2: "height", 3: "width"},
        "output": {0: "batch", 2: "height", 3: "width"}
    }

# export the model to onnx
torch.onnx.export(inference_model, dummy_input, f"model_{model_id}.onnx", 
                    input_names=['input'], output_names=['output'], 
                    do_constant_folding=True, 
                   opset_version=16,dynamic_axes=dynamic_axes, verbose=False, 
                   export_params=True) 

# start inference
options = SessionOptions()
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
onnx_session= onnxrt.InferenceSession(f"model_{model_id}.onnx", options, providers=['CUDAExecutionProvider']) 

onnx_inputs= {onnx_session.get_inputs()[0].name: img_torch.unsqueeze(0).cpu().numpy()}

# perform inference and calculate time taken
t_0 = time.perf_counter()
onnx_output = onnx_session.run(None, onnx_inputs)[0]
t_1 = time.perf_counter()

Inference time output:

second inference time with PyTorch on GPU RTX4090--> torch.Size([155, 3, 384, 384]): 0.02 seconds second inference time with ONNX with GPU RTX4090--> torch.Size([155, 3, 384, 384]) is --> 5.06 seconds

I tried https://huggingface.co/docs/transformers/v4.19.0/serialization with a thought that my be this would short the inference time but couldn't succeed with tokenizer issue

from transformers import AutoTokenizer, AutoModelForSemanticSegmentation

# Load tokenizer and PyTorch weights form the Hub

pt_model = AutoModelForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
tokenizer = AutoTokenizer.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
local_pt_checkpoint = ".\\segformer_trained_checkpoint"

# save
pt_model.save_pretrained(local_pt_checkpoint)

# export to onnx
!python -m transformers.onnx --model=local_pt_checkpoint onnx/

error:

KeyError                                  Traceback (most recent call last)
Cell In[38], line 6
---> 6 tokenizer = AutoTokenizer.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512"

KeyError: <class 'transformers.models.segformer.configuration_segformer.SegformerConfig'>

Is there any way to reduce the inference time for ONNX or any flag i'm missing in onnx export call? Any help would be appreciated. Thank you :hugs:

NVlabs / SegFormer

Exported custom segformer model with pretrained weights ("nvidia/segformer-b0-finetuned-ade-512-512") to ONNX takes longer for inference compare to PyTorch! #149