paulds8 commented 2 years ago

Issue Type

Feature Request

OS

Ubuntu

OS architecture

aarch64

Programming Language

Python

Framework

ONNX

Model name and Weights/Checkpoints URL

OpenAI CLIP

https://github.com/openai/CLIP

I used the following notebook with opset=14 targeting this PR (https://github.com/openai/CLIP/pull/219) to convert both the image & text "branches" of ViT-B/32: https://colab.research.google.com/github/josephrocca/openai-clip-js/blob/main/Export_CLIP_to_ONNX_tflite_tfjs_tf_saved_model.ipynb#scrollTo=kDmmi0vMI9WY

I then ensure all shapes are properly defined so I can use the onnxruntime with the TensorRT backend: python3 -m onnxruntime.tools.symbolic_shape_infer --input onnx_models/clip-text-vit-32.onnx --output onnx_models/clip-text-vit-32-shaped.onnx python3 -m onnxruntime.tools.symbolic_shape_infer --input onnx_models/clip-image-vit-32.onnx --output onnx_models/clip-image-vit-32-shaped.onnx

Description

I am working on a real-time AI in an embedded context (Jetson TX2 is the target device). I need the latency to be as low as possible for all elements.

I stumbled across this repo today and there are definitely some models here that will speed up my dev process considerably.

Thank you!!

I see you haven't yet added CLIP to the zoo. As you have significant experience optimizing models in constrained environments, I was hoping for some assistance.

The first challenge is getting the latency of using a CLIP model as low as possible.

Using the code below, on a Jetson TX2 I am able to get a result in ~0.24s on average for the model described above. I was hoping you'd have some techniques to help significantly reduce this. My ideal goal is to reduce this to roughly 0.05s on a TX2.

CLIP already used FP16 in many places which the TX2's GPU can take advantage of. I haven't been able to successfully simplify the graph further or force more of the graph to FP16.

I was hoping you'd be able to help. I am happy to test and help in any way I can!

Code:

import torch
import onnxruntime

class ClipOnnx:
    def __init__(
        self,
        image_path: str = "clip_image.onnx",
        text_path: str = "clip_text.onnx",
        logit_scale: float = 4.6052
    ):
        self.image_path = image_path
        self.image_flag = True
        self.text_path = text_path
        self.text_flag = True
        self.logit_scale = logit_scale
        self.providers = [
            ('TensorrtExecutionProvider', {
                'device_id': 0,
                'trt_max_workspace_size': 4 * 1024 * 1024 * 1024,
                'trt_max_partition_iterations': 10000,
                'trt_fp16_enable': True,
                'trt_engine_cache_path': '.trtcache',
                'trt_engine_cache_enable': True,
                'trt_min_subgraph_size': 3,
                'trt_dla_enable': False,
            }),
            ('CUDAExecutionProvider', {
                'device_id': 0,
                'arena_extend_strategy': 'kSameAsRequested',
                'gpu_mem_limit': 1 * 1024 * 1024 * 1024,
                'cudnn_conv_algo_search': 'HEURISTIC',
                'do_copy_in_default_stream': True,
            })
        ]

    def start_sessions(
        self,
    ):
        print("Starting Image Branch Inference Session...")
        if self.image_flag:
            self.image_session = onnxruntime.InferenceSession(self.image_path,
                                                               providers=self.providers)
        print("Starting Text Branch Inference Session...")
        if self.text_flag:
            self.textual_session = onnxruntime.InferenceSession(self.text_path,
                                                                providers=self.providers)

    def image_run(self, onnx_image):
        onnx_input_image = {self.image_session.get_inputs()[0].name: onnx_image}
        image_output, = self.image_session.run(None, onnx_input_image)
        return image_output

    def textual_run(self, onnx_text):
        onnx_input_text = {self.textual_session.get_inputs()[0].name: onnx_text}
        textual_output, = self.textual_session.run(None, onnx_input_text)
        return textual_output

    def __call__(self, image, text, device: str = "cuda:0"):
        assert self.image_flag and self.text_flag
        image_features = torch.from_numpy(self.image_run(image)).to(device)
        text_features = torch.from_numpy(self.textual_run(text)).to(device)

        # normalized features
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # cosine similarity as logits
        logits_per_image = self.logit_scale * image_features @ text_features.t()
        logits_per_text = logits_per_image.t()

        return logits_per_image, logits_per_text

    def encode_image(self, image):
        return self.image_run(image)

    def encode_text(self, text):
        return self.textual_run(text)

import torch
from clip import clip
import numpy as np
from clip_onnx import ClipOnnx

npx = 224 # torchscript_model.visual.input_resolution for ViT-B/32
dummy_image = torch.randn(10, 3, npx, npx, dtype=torch.half).numpy()
dummy_texts = clip.tokenize(["quick brown fox", "lorem ipsum"]).numpy().astype(np.int32)

clip_onnx = ClipOnnx(
    image_path = "onnx_models/clip-image-vit-32-shaped.onnx",
    text_path = "onnx_models/clip-text-vit-32-shaped.onnx"
)

clip_onnx.start_sessions()

import time
for i in range(10):
    start = time.time()

    result = clip_onnx(dummy_image, dummy_texts)
    print(i, time.time()-start)

Relevant Log Output

No response

URL or source code for simple inference testing code

No response

paulds8 commented 2 years ago

It's also worth noting that I don't need to use CLIP specifically, if there is a more efficient alternative you know of that achieves the same outcome I can 100% consider this instead.

PINTO0309 commented 2 years ago

Can the batch size be fixed at 1?

paulds8 commented 2 years ago

The batch size can be fixed in the context I need to apply this. Perhaps not necessarily at 1, but at a fixed, number, yes. I will attempt to generate an ONNX graph with a few fixed batch sizes and let you know the outcome.

paulds8 commented 2 years ago

Doesn't seem like it made any significant difference in this case. Perhaps I'll get the speed up I'm looking for once pruned models are released.

PINTO0309 commented 2 years ago

I am not really interested in talking about your performance. But I am at a loss to understand why you would ignore the error message. I am not your handyman.

I will delete the file attached here within a few days, as it is taking up space on my Google Drive. Whether I get a reply from you or not.

onnx https://drive.google.com/file/d/18i9rLY1l5rhia-w4Ie4Bu7JO4hU2BFtJ/view?usp=sharing

ArgMax Float64

Workaround for conversion error when applying ArgMax/ArgMin to TensorRT if INT64 or INT32 is specified as the input tensor.

model.py


from collections import OrderedDict
from typing import Tuple, Union

import numpy as np import torch import torch.nn.functional as F from torch import nn

class Bottleneck(nn.Module): expansion = 4

def __init__(self, inplanes, planes, stride=1):
    super().__init__()

    # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
    self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
    self.bn1 = nn.BatchNorm2d(planes)

    self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(planes)

    self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()

    self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
    self.bn3 = nn.BatchNorm2d(planes * self.expansion)

    self.relu = nn.ReLU(inplace=True)
    self.downsample = None
    self.stride = stride

    if stride > 1 or inplanes != planes * Bottleneck.expansion:
        # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
        self.downsample = nn.Sequential(OrderedDict([
            ("-1", nn.AvgPool2d(stride)),
            ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
            ("1", nn.BatchNorm2d(planes * self.expansion))
        ]))

def forward(self, x: torch.Tensor):
    identity = x

    out = self.relu(self.bn1(self.conv1(x)))
    out = self.relu(self.bn2(self.conv2(out)))
    out = self.avgpool(out)
    out = self.bn3(self.conv3(out))

    if self.downsample is not None:
        identity = self.downsample(x)

    out += identity
    out = self.relu(out)
    return out

class AttentionPool2d(nn.Module): def init(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None): super().init() self.positional_embedding = nn.Parameter(torch.randn(spacial_dim 2 + 1, embed_dim) / embed_dim 0.5) self.k_proj = nn.Linear(embed_dim, embed_dim) self.q_proj = nn.Linear(embed_dim, embed_dim) self.v_proj = nn.Linear(embed_dim, embed_dim) self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim) self.num_heads = num_heads

def forward(self, x):
    x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
    x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
    x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
    x, _ = F.multi_head_attention_forward(
        query=x, key=x, value=x,
        embed_dim_to_check=x.shape[-1],
        num_heads=self.num_heads,
        q_proj_weight=self.q_proj.weight,
        k_proj_weight=self.k_proj.weight,
        v_proj_weight=self.v_proj.weight,
        in_proj_weight=None,
        in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
        bias_k=None,
        bias_v=None,
        add_zero_attn=False,
        dropout_p=0,
        out_proj_weight=self.c_proj.weight,
        out_proj_bias=self.c_proj.bias,
        use_separate_proj_weight=True,
        training=self.training,
        need_weights=False
    )

    return x[0]

class ModifiedResNet(nn.Module): """ A ResNet class that is similar to torchvision's but contains the following changes:

There are now 3 "stem" convolutions as opposed to 1, with an average pool instead of a max pool.
Performs anti-aliasing strided convolutions, where an avgpool is prepended to convolutions with stride > 1

The final pooling layer is a QKV attention instead of an average pool """

def init(self, layers, output_dim, heads, input_resolution=224, width=64): super().init() self.output_dim = output_dim self.input_resolution = input_resolution

# the 3-layer stem
self.conv1 = nn.Conv2d(3, width // 2, kernel_size=3, stride=2, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(width // 2)
self.conv2 = nn.Conv2d(width // 2, width // 2, kernel_size=3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(width // 2)
self.conv3 = nn.Conv2d(width // 2, width, kernel_size=3, padding=1, bias=False)
self.bn3 = nn.BatchNorm2d(width)
self.avgpool = nn.AvgPool2d(2)
self.relu = nn.ReLU(inplace=True)

# residual layers
self._inplanes = width  # this is a *mutable* variable used during construction
self.layer1 = self._make_layer(width, layers[0])
self.layer2 = self._make_layer(width * 2, layers[1], stride=2)
self.layer3 = self._make_layer(width * 4, layers[2], stride=2)
self.layer4 = self._make_layer(width * 8, layers[3], stride=2)

embed_dim = width * 32  # the ResNet feature dimension
self.attnpool = AttentionPool2d(input_resolution // 32, embed_dim, heads, output_dim)

def _make_layer(self, planes, blocks, stride=1): layers = [Bottleneck(self._inplanes, planes, stride)]

self._inplanes = planes * Bottleneck.expansion
for _ in range(1, blocks):
    layers.append(Bottleneck(self._inplanes, planes))

return nn.Sequential(*layers)

def forward(self, x): def stem(x): for conv, bn in [(self.conv1, self.bn1), (self.conv2, self.bn2), (self.conv3, self.bn3)]: x = self.relu(bn(conv(x))) x = self.avgpool(x) return x

x = x.type(self.conv1.weight.dtype)
x = stem(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.attnpool(x)

return x

class LayerNorm(nn.LayerNorm): """Subclass torch's LayerNorm to handle fp16."""

def forward(self, x: torch.Tensor):
    orig_type = x.dtype
    ret = super().forward(x.type(torch.float32))
    return ret.type(orig_type)

class QuickGELU(nn.Module): def forward(self, x: torch.Tensor): return x torch.sigmoid(1.702 x)

class ResidualAttentionBlock(nn.Module): def init(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None): super().init()

    self.attn = nn.MultiheadAttention(d_model, n_head)
    self.ln_1 = LayerNorm(d_model)
    self.mlp = nn.Sequential(OrderedDict([
        ("c_fc", nn.Linear(d_model, d_model * 4)),
        ("gelu", QuickGELU()),
        ("c_proj", nn.Linear(d_model * 4, d_model))
    ]))
    self.ln_2 = LayerNorm(d_model)
    self.attn_mask = attn_mask

def attention(self, x: torch.Tensor):
    self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
    return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]

def forward(self, x: torch.Tensor):
    x = x + self.attention(self.ln_1(x))
    x = x + self.mlp(self.ln_2(x))
    return x

class Transformer(nn.Module): def init(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None): super().init() self.width = width self.layers = layers self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attnmask) for in range(layers)])

def forward(self, x: torch.Tensor):
    return self.resblocks(x)

class VisionTransformer(nn.Module): def init(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int): super().init() self.input_resolution = input_resolution self.output_dim = output_dim self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)

    scale = width ** -0.5
    self.class_embedding = nn.Parameter(scale * torch.randn(width))
    self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
    self.ln_pre = LayerNorm(width)

    self.transformer = Transformer(width, layers, heads)

    self.ln_post = LayerNorm(width)
    self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

def forward(self, x: torch.Tensor):
    x = self.conv1(x)  # shape = [*, width, grid, grid]
    x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
    x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
    x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
    x = x + self.positional_embedding.to(x.dtype)
    x = self.ln_pre(x)

    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD

    x = self.ln_post(x[:, 0, :])

    if self.proj is not None:
        x = x @ self.proj

    return x

class CLIP(nn.Module): def init(self, embed_dim: int,

vision

             image_resolution: int,
             vision_layers: Union[Tuple[int, int, int, int], int],
             vision_width: int,
             vision_patch_size: int,
             # text
             context_length: int,
             vocab_size: int,
             transformer_width: int,
             transformer_heads: int,
             transformer_layers: int
             ):
    super().__init__()

    self.context_length = context_length

    if isinstance(vision_layers, (tuple, list)):
        vision_heads = vision_width * 32 // 64
        self.visual = ModifiedResNet(
            layers=vision_layers,
            output_dim=embed_dim,
            heads=vision_heads,
            input_resolution=image_resolution,
            width=vision_width
        )
    else:
        vision_heads = vision_width // 64
        self.visual = VisionTransformer(
            input_resolution=image_resolution,
            patch_size=vision_patch_size,
            width=vision_width,
            layers=vision_layers,
            heads=vision_heads,
            output_dim=embed_dim
        )

    self.transformer = Transformer(
        width=transformer_width,
        layers=transformer_layers,
        heads=transformer_heads,
        attn_mask=self.build_attention_mask()
    )

    self.vocab_size = vocab_size
    self.token_embedding = nn.Embedding(vocab_size, transformer_width)
    self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
    self.ln_final = LayerNorm(transformer_width)

    self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
    self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

    self.initialize_parameters()

def initialize_parameters(self):
    nn.init.normal_(self.token_embedding.weight, std=0.02)
    nn.init.normal_(self.positional_embedding, std=0.01)

    if isinstance(self.visual, ModifiedResNet):
        if self.visual.attnpool is not None:
            std = self.visual.attnpool.c_proj.in_features ** -0.5
            nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)

        for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
            for name, param in resnet_block.named_parameters():
                if name.endswith("bn3.weight"):
                    nn.init.zeros_(param)

    proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
    attn_std = self.transformer.width ** -0.5
    fc_std = (2 * self.transformer.width) ** -0.5
    for block in self.transformer.resblocks:
        nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
        nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
        nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
        nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)

    if self.text_projection is not None:
        nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)

def build_attention_mask(self):
    # lazily create causal attention mask, with full attention between the vision tokens
    # pytorch uses additive attention mask; fill with -inf
    mask = torch.empty(self.context_length, self.context_length)
    mask.fill_(float("-inf"))
    mask.triu_(1)  # zero out the lower diagonal
    return mask

@property
def dtype(self):
    return self.visual.conv1.weight.dtype

def encode_image(self, image):
    return self.visual(image.type(self.dtype))

def encode_text(self, text):
    x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # x.shape = [batch_size, n_ctx, transformer.width]
    # take features from the eot embedding (eot_token is the highest number in each sequence)
    x = x[torch.arange(x.shape[0]), text.to(torch.float64).argmax(dim=-1)] @ self.text_projection

    return x

def forward(self, image, text):
    image_features = self.encode_image(image).cpu()
    text_features = self.encode_text(text).cpu()

    # normalized features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.cpu().exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()

    # shape = [global_batch_size, global_batch_size]
    return logits_per_image, logits_per_text

def convert_weights(model: nn.Module): """Convert applicable model parameters to fp16"""

def _convert_weights_to_fp16(l):
    if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
        l.weight.data = l.weight.data.half()
        if l.bias is not None:
            l.bias.data = l.bias.data.half()

    if isinstance(l, nn.MultiheadAttention):
        for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
            tensor = getattr(l, attr)
            if tensor is not None:
                tensor.data = tensor.data.half()

    for name in ["text_projection", "proj"]:
        if hasattr(l, name):
            attr = getattr(l, name)
            if attr is not None:
                attr.data = attr.data.half()

model.apply(_convert_weights_to_fp16)

def build_model(state_dict: dict): vit = "visual.proj" in state_dict

if vit:
    vision_width = state_dict["visual.conv1.weight"].shape[0]
    vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
    vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
    grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
    image_resolution = vision_patch_size * grid_size
else:
    counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
    vision_layers = tuple(counts)
    vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
    output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
    vision_patch_size = None
    assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
    image_resolution = output_width * 32

embed_dim = state_dict["text_projection"].shape[1]
context_length = state_dict["positional_embedding"].shape[0]
vocab_size = state_dict["token_embedding.weight"].shape[0]
transformer_width = state_dict["ln_final.weight"].shape[0]
transformer_heads = transformer_width // 64
transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))

model = CLIP(
    embed_dim,
    image_resolution, vision_layers, vision_width, vision_patch_size,
    context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
)

for key in ["input_resolution", "context_length", "vocab_size"]:
    if key in state_dict:
        del state_dict[key]

convert_weights(model)
model.load_state_dict(state_dict)
return model.eval()


- export.py
```python
import os
import skimage
from PIL import Image
import numpy as np
import torch
import onnx
from onnxsim import simplify
import clip

clip.available_models()

load_model = 'ViT-B/32'

model, preprocess = clip.load(load_model)
model.cuda().eval()
model.float()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

clip.tokenize("Hello World!")

# images in skimage to use and their textual descriptions
descriptions = {
    "astronaut": "a portrait of an astronaut with the American flag",
}

original_images = []
images = []
texts = []

for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
    name = os.path.splitext(filename)[0]
    if name not in descriptions:
        continue

    image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(descriptions[name])

image_input = torch.tensor(np.stack(images)).cpu()#.cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()

model.visual.cpu()
model.visual(image_input)[0] # astronaut pic embedding
model(image_input, text_tokens.to(torch.int32))[0] # astronaut text embedding

onnx_file = f"clip_{load_model.lower().replace('-','_').replace('/','_')}.onnx"

torch.onnx.export(
    model,
    (image_input, text_tokens),
    onnx_file,
    opset_version=11,
)

model_onnx2 = onnx.load(onnx_file)
model_simp, check = simplify(model_onnx2)
onnx.save(model_simp, onnx_file)

python export.py

onnxsim clip_vit_b_32.onnx clip_vit_b_32.onnx

onnx2trt clip_vit_b_32.onnx -o clip_vit_b_32.trt -b 1 -d 16 -v

[2022-04-02 13:07:15 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-04-02 13:07:15    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4041, GPU 2430 (MiB)
[2022-04-02 13:07:15    INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4041, GPU 2438 (MiB)
[2022-04-02 13:07:15    INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +4, GPU +289, now: CPU 4, GPU 289 (MiB)
Writing TensorRT engine to clip_vit_b_32.trt
All done

clip_vit_b_32 onnx

javiabellan commented 1 year ago

@PINTO0309 How did you get such a neat and clean ONNX visualization of your CLIP model? When I use Netron, (even simplifying my model onnxsim) I get a messy graph like this:

Captura de pantalla 2022-12-01 a las 1 37 08

PINTO0309 commented 1 year ago

@javiabellan Do not start another discussion on a closed issue.

PINTO0309 / PINTO_model_zoo

OpenAI CLIP #246

Issue Type

OS

OS architecture

Programming Language

Framework

Model name and Weights/Checkpoints URL

Description

Relevant Log Output

URL or source code for simple inference testing code

vision