PINTO0309 / PINTO_model_zoo

A repository for storing models that have been inter-converted between various frameworks. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML.
https://qiita.com/PINTO
MIT License
3.59k stars 573 forks source link

OpenAI CLIP #246

Closed paulds8 closed 2 years ago

paulds8 commented 2 years ago

Issue Type

Feature Request

OS

Ubuntu

OS architecture

aarch64

Programming Language

Python

Framework

ONNX

Model name and Weights/Checkpoints URL

OpenAI CLIP

https://github.com/openai/CLIP

I used the following notebook with opset=14 targeting this PR (https://github.com/openai/CLIP/pull/219) to convert both the image & text "branches" of ViT-B/32: https://colab.research.google.com/github/josephrocca/openai-clip-js/blob/main/Export_CLIP_to_ONNX_tflite_tfjs_tf_saved_model.ipynb#scrollTo=kDmmi0vMI9WY

I then ensure all shapes are properly defined so I can use the onnxruntime with the TensorRT backend: python3 -m onnxruntime.tools.symbolic_shape_infer --input onnx_models/clip-text-vit-32.onnx --output onnx_models/clip-text-vit-32-shaped.onnx python3 -m onnxruntime.tools.symbolic_shape_infer --input onnx_models/clip-image-vit-32.onnx --output onnx_models/clip-image-vit-32-shaped.onnx

Description

I am working on a real-time AI in an embedded context (Jetson TX2 is the target device). I need the latency to be as low as possible for all elements.

I stumbled across this repo today and there are definitely some models here that will speed up my dev process considerably.

Thank you!!

I see you haven't yet added CLIP to the zoo. As you have significant experience optimizing models in constrained environments, I was hoping for some assistance.

The first challenge is getting the latency of using a CLIP model as low as possible.

Using the code below, on a Jetson TX2 I am able to get a result in ~0.24s on average for the model described above. I was hoping you'd have some techniques to help significantly reduce this. My ideal goal is to reduce this to roughly 0.05s on a TX2.

CLIP already used FP16 in many places which the TX2's GPU can take advantage of. I haven't been able to successfully simplify the graph further or force more of the graph to FP16.

I was hoping you'd be able to help. I am happy to test and help in any way I can!

Code:

import torch
import onnxruntime

class ClipOnnx:
    def __init__(
        self,
        image_path: str = "clip_image.onnx",
        text_path: str = "clip_text.onnx",
        logit_scale: float = 4.6052
    ):
        self.image_path = image_path
        self.image_flag = True
        self.text_path = text_path
        self.text_flag = True
        self.logit_scale = logit_scale
        self.providers = [
            ('TensorrtExecutionProvider', {
                'device_id': 0,
                'trt_max_workspace_size': 4 * 1024 * 1024 * 1024,
                'trt_max_partition_iterations': 10000,
                'trt_fp16_enable': True,
                'trt_engine_cache_path': '.trtcache',
                'trt_engine_cache_enable': True,
                'trt_min_subgraph_size': 3,
                'trt_dla_enable': False,
            }),
            ('CUDAExecutionProvider', {
                'device_id': 0,
                'arena_extend_strategy': 'kSameAsRequested',
                'gpu_mem_limit': 1 * 1024 * 1024 * 1024,
                'cudnn_conv_algo_search': 'HEURISTIC',
                'do_copy_in_default_stream': True,
            })
        ]

    def start_sessions(
        self,
    ):
        print("Starting Image Branch Inference Session...")
        if self.image_flag:
            self.image_session = onnxruntime.InferenceSession(self.image_path,
                                                               providers=self.providers)
        print("Starting Text Branch Inference Session...")
        if self.text_flag:
            self.textual_session = onnxruntime.InferenceSession(self.text_path,
                                                                providers=self.providers)

    def image_run(self, onnx_image):
        onnx_input_image = {self.image_session.get_inputs()[0].name: onnx_image}
        image_output, = self.image_session.run(None, onnx_input_image)
        return image_output

    def textual_run(self, onnx_text):
        onnx_input_text = {self.textual_session.get_inputs()[0].name: onnx_text}
        textual_output, = self.textual_session.run(None, onnx_input_text)
        return textual_output

    def __call__(self, image, text, device: str = "cuda:0"):
        assert self.image_flag and self.text_flag
        image_features = torch.from_numpy(self.image_run(image)).to(device)
        text_features = torch.from_numpy(self.textual_run(text)).to(device)

        # normalized features
        image_features = image_features / image_features.norm(dim=-1, keepdim=True)
        text_features = text_features / text_features.norm(dim=-1, keepdim=True)

        # cosine similarity as logits
        logits_per_image = self.logit_scale * image_features @ text_features.t()
        logits_per_text = logits_per_image.t()

        return logits_per_image, logits_per_text

    def encode_image(self, image):
        return self.image_run(image)

    def encode_text(self, text):
        return self.textual_run(text)

import torch
from clip import clip
import numpy as np
from clip_onnx import ClipOnnx

npx = 224 # torchscript_model.visual.input_resolution for ViT-B/32
dummy_image = torch.randn(10, 3, npx, npx, dtype=torch.half).numpy()
dummy_texts = clip.tokenize(["quick brown fox", "lorem ipsum"]).numpy().astype(np.int32)

clip_onnx = ClipOnnx(
    image_path = "onnx_models/clip-image-vit-32-shaped.onnx",
    text_path = "onnx_models/clip-text-vit-32-shaped.onnx"
)

clip_onnx.start_sessions()

import time
for i in range(10):
    start = time.time()

    result = clip_onnx(dummy_image, dummy_texts)
    print(i, time.time()-start)

Relevant Log Output

No response

URL or source code for simple inference testing code

No response

paulds8 commented 2 years ago

It's also worth noting that I don't need to use CLIP specifically, if there is a more efficient alternative you know of that achieves the same outcome I can 100% consider this instead.

PINTO0309 commented 2 years ago

Can the batch size be fixed at 1?

paulds8 commented 2 years ago

The batch size can be fixed in the context I need to apply this. Perhaps not necessarily at 1, but at a fixed, number, yes. I will attempt to generate an ONNX graph with a few fixed batch sizes and let you know the outcome.

paulds8 commented 2 years ago

Doesn't seem like it made any significant difference in this case. Perhaps I'll get the speed up I'm looking for once pruned models are released.

PINTO0309 commented 2 years ago

I am not really interested in talking about your performance. But I am at a loss to understand why you would ignore the error message. I am not your handyman.

I will delete the file attached here within a few days, as it is taking up space on my Google Drive. Whether I get a reply from you or not.

ArgMax Float64

Workaround for conversion error when applying ArgMax/ArgMin to TensorRT if INT64 or INT32 is specified as the input tensor.

import numpy as np import torch import torch.nn.functional as F from torch import nn

class Bottleneck(nn.Module): expansion = 4

def __init__(self, inplanes, planes, stride=1):
    super().__init__()

    # all conv layers have stride 1. an avgpool is performed after the second convolution when stride > 1
    self.conv1 = nn.Conv2d(inplanes, planes, 1, bias=False)
    self.bn1 = nn.BatchNorm2d(planes)

    self.conv2 = nn.Conv2d(planes, planes, 3, padding=1, bias=False)
    self.bn2 = nn.BatchNorm2d(planes)

    self.avgpool = nn.AvgPool2d(stride) if stride > 1 else nn.Identity()

    self.conv3 = nn.Conv2d(planes, planes * self.expansion, 1, bias=False)
    self.bn3 = nn.BatchNorm2d(planes * self.expansion)

    self.relu = nn.ReLU(inplace=True)
    self.downsample = None
    self.stride = stride

    if stride > 1 or inplanes != planes * Bottleneck.expansion:
        # downsampling layer is prepended with an avgpool, and the subsequent convolution has stride 1
        self.downsample = nn.Sequential(OrderedDict([
            ("-1", nn.AvgPool2d(stride)),
            ("0", nn.Conv2d(inplanes, planes * self.expansion, 1, stride=1, bias=False)),
            ("1", nn.BatchNorm2d(planes * self.expansion))
        ]))

def forward(self, x: torch.Tensor):
    identity = x

    out = self.relu(self.bn1(self.conv1(x)))
    out = self.relu(self.bn2(self.conv2(out)))
    out = self.avgpool(out)
    out = self.bn3(self.conv3(out))

    if self.downsample is not None:
        identity = self.downsample(x)

    out += identity
    out = self.relu(out)
    return out

class AttentionPool2d(nn.Module): def init(self, spacial_dim: int, embed_dim: int, num_heads: int, output_dim: int = None): super().init() self.positional_embedding = nn.Parameter(torch.randn(spacial_dim 2 + 1, embed_dim) / embed_dim 0.5) self.k_proj = nn.Linear(embed_dim, embed_dim) self.q_proj = nn.Linear(embed_dim, embed_dim) self.v_proj = nn.Linear(embed_dim, embed_dim) self.c_proj = nn.Linear(embed_dim, output_dim or embed_dim) self.num_heads = num_heads

def forward(self, x):
    x = x.reshape(x.shape[0], x.shape[1], x.shape[2] * x.shape[3]).permute(2, 0, 1)  # NCHW -> (HW)NC
    x = torch.cat([x.mean(dim=0, keepdim=True), x], dim=0)  # (HW+1)NC
    x = x + self.positional_embedding[:, None, :].to(x.dtype)  # (HW+1)NC
    x, _ = F.multi_head_attention_forward(
        query=x, key=x, value=x,
        embed_dim_to_check=x.shape[-1],
        num_heads=self.num_heads,
        q_proj_weight=self.q_proj.weight,
        k_proj_weight=self.k_proj.weight,
        v_proj_weight=self.v_proj.weight,
        in_proj_weight=None,
        in_proj_bias=torch.cat([self.q_proj.bias, self.k_proj.bias, self.v_proj.bias]),
        bias_k=None,
        bias_v=None,
        add_zero_attn=False,
        dropout_p=0,
        out_proj_weight=self.c_proj.weight,
        out_proj_bias=self.c_proj.bias,
        use_separate_proj_weight=True,
        training=self.training,
        need_weights=False
    )

    return x[0]

class ModifiedResNet(nn.Module): """ A ResNet class that is similar to torchvision's but contains the following changes:

class LayerNorm(nn.LayerNorm): """Subclass torch's LayerNorm to handle fp16."""

def forward(self, x: torch.Tensor):
    orig_type = x.dtype
    ret = super().forward(x.type(torch.float32))
    return ret.type(orig_type)

class QuickGELU(nn.Module): def forward(self, x: torch.Tensor): return x torch.sigmoid(1.702 x)

class ResidualAttentionBlock(nn.Module): def init(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None): super().init()

    self.attn = nn.MultiheadAttention(d_model, n_head)
    self.ln_1 = LayerNorm(d_model)
    self.mlp = nn.Sequential(OrderedDict([
        ("c_fc", nn.Linear(d_model, d_model * 4)),
        ("gelu", QuickGELU()),
        ("c_proj", nn.Linear(d_model * 4, d_model))
    ]))
    self.ln_2 = LayerNorm(d_model)
    self.attn_mask = attn_mask

def attention(self, x: torch.Tensor):
    self.attn_mask = self.attn_mask.to(dtype=x.dtype, device=x.device) if self.attn_mask is not None else None
    return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]

def forward(self, x: torch.Tensor):
    x = x + self.attention(self.ln_1(x))
    x = x + self.mlp(self.ln_2(x))
    return x

class Transformer(nn.Module): def init(self, width: int, layers: int, heads: int, attn_mask: torch.Tensor = None): super().init() self.width = width self.layers = layers self.resblocks = nn.Sequential(*[ResidualAttentionBlock(width, heads, attnmask) for in range(layers)])

def forward(self, x: torch.Tensor):
    return self.resblocks(x)

class VisionTransformer(nn.Module): def init(self, input_resolution: int, patch_size: int, width: int, layers: int, heads: int, output_dim: int): super().init() self.input_resolution = input_resolution self.output_dim = output_dim self.conv1 = nn.Conv2d(in_channels=3, out_channels=width, kernel_size=patch_size, stride=patch_size, bias=False)

    scale = width ** -0.5
    self.class_embedding = nn.Parameter(scale * torch.randn(width))
    self.positional_embedding = nn.Parameter(scale * torch.randn((input_resolution // patch_size) ** 2 + 1, width))
    self.ln_pre = LayerNorm(width)

    self.transformer = Transformer(width, layers, heads)

    self.ln_post = LayerNorm(width)
    self.proj = nn.Parameter(scale * torch.randn(width, output_dim))

def forward(self, x: torch.Tensor):
    x = self.conv1(x)  # shape = [*, width, grid, grid]
    x = x.reshape(x.shape[0], x.shape[1], -1)  # shape = [*, width, grid ** 2]
    x = x.permute(0, 2, 1)  # shape = [*, grid ** 2, width]
    x = torch.cat([self.class_embedding.to(x.dtype) + torch.zeros(x.shape[0], 1, x.shape[-1], dtype=x.dtype, device=x.device), x], dim=1)  # shape = [*, grid ** 2 + 1, width]
    x = x + self.positional_embedding.to(x.dtype)
    x = self.ln_pre(x)

    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD

    x = self.ln_post(x[:, 0, :])

    if self.proj is not None:
        x = x @ self.proj

    return x

class CLIP(nn.Module): def init(self, embed_dim: int,

vision

             image_resolution: int,
             vision_layers: Union[Tuple[int, int, int, int], int],
             vision_width: int,
             vision_patch_size: int,
             # text
             context_length: int,
             vocab_size: int,
             transformer_width: int,
             transformer_heads: int,
             transformer_layers: int
             ):
    super().__init__()

    self.context_length = context_length

    if isinstance(vision_layers, (tuple, list)):
        vision_heads = vision_width * 32 // 64
        self.visual = ModifiedResNet(
            layers=vision_layers,
            output_dim=embed_dim,
            heads=vision_heads,
            input_resolution=image_resolution,
            width=vision_width
        )
    else:
        vision_heads = vision_width // 64
        self.visual = VisionTransformer(
            input_resolution=image_resolution,
            patch_size=vision_patch_size,
            width=vision_width,
            layers=vision_layers,
            heads=vision_heads,
            output_dim=embed_dim
        )

    self.transformer = Transformer(
        width=transformer_width,
        layers=transformer_layers,
        heads=transformer_heads,
        attn_mask=self.build_attention_mask()
    )

    self.vocab_size = vocab_size
    self.token_embedding = nn.Embedding(vocab_size, transformer_width)
    self.positional_embedding = nn.Parameter(torch.empty(self.context_length, transformer_width))
    self.ln_final = LayerNorm(transformer_width)

    self.text_projection = nn.Parameter(torch.empty(transformer_width, embed_dim))
    self.logit_scale = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))

    self.initialize_parameters()

def initialize_parameters(self):
    nn.init.normal_(self.token_embedding.weight, std=0.02)
    nn.init.normal_(self.positional_embedding, std=0.01)

    if isinstance(self.visual, ModifiedResNet):
        if self.visual.attnpool is not None:
            std = self.visual.attnpool.c_proj.in_features ** -0.5
            nn.init.normal_(self.visual.attnpool.q_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.k_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.v_proj.weight, std=std)
            nn.init.normal_(self.visual.attnpool.c_proj.weight, std=std)

        for resnet_block in [self.visual.layer1, self.visual.layer2, self.visual.layer3, self.visual.layer4]:
            for name, param in resnet_block.named_parameters():
                if name.endswith("bn3.weight"):
                    nn.init.zeros_(param)

    proj_std = (self.transformer.width ** -0.5) * ((2 * self.transformer.layers) ** -0.5)
    attn_std = self.transformer.width ** -0.5
    fc_std = (2 * self.transformer.width) ** -0.5
    for block in self.transformer.resblocks:
        nn.init.normal_(block.attn.in_proj_weight, std=attn_std)
        nn.init.normal_(block.attn.out_proj.weight, std=proj_std)
        nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)
        nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)

    if self.text_projection is not None:
        nn.init.normal_(self.text_projection, std=self.transformer.width ** -0.5)

def build_attention_mask(self):
    # lazily create causal attention mask, with full attention between the vision tokens
    # pytorch uses additive attention mask; fill with -inf
    mask = torch.empty(self.context_length, self.context_length)
    mask.fill_(float("-inf"))
    mask.triu_(1)  # zero out the lower diagonal
    return mask

@property
def dtype(self):
    return self.visual.conv1.weight.dtype

def encode_image(self, image):
    return self.visual(image.type(self.dtype))

def encode_text(self, text):
    x = self.token_embedding(text).type(self.dtype)  # [batch_size, n_ctx, d_model]

    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # x.shape = [batch_size, n_ctx, transformer.width]
    # take features from the eot embedding (eot_token is the highest number in each sequence)
    x = x[torch.arange(x.shape[0]), text.to(torch.float64).argmax(dim=-1)] @ self.text_projection

    return x

def forward(self, image, text):
    image_features = self.encode_image(image).cpu()
    text_features = self.encode_text(text).cpu()

    # normalized features
    image_features = image_features / image_features.norm(dim=1, keepdim=True)
    text_features = text_features / text_features.norm(dim=1, keepdim=True)

    # cosine similarity as logits
    logit_scale = self.logit_scale.cpu().exp()
    logits_per_image = logit_scale * image_features @ text_features.t()
    logits_per_text = logits_per_image.t()

    # shape = [global_batch_size, global_batch_size]
    return logits_per_image, logits_per_text

def convert_weights(model: nn.Module): """Convert applicable model parameters to fp16"""

def _convert_weights_to_fp16(l):
    if isinstance(l, (nn.Conv1d, nn.Conv2d, nn.Linear)):
        l.weight.data = l.weight.data.half()
        if l.bias is not None:
            l.bias.data = l.bias.data.half()

    if isinstance(l, nn.MultiheadAttention):
        for attr in [*[f"{s}_proj_weight" for s in ["in", "q", "k", "v"]], "in_proj_bias", "bias_k", "bias_v"]:
            tensor = getattr(l, attr)
            if tensor is not None:
                tensor.data = tensor.data.half()

    for name in ["text_projection", "proj"]:
        if hasattr(l, name):
            attr = getattr(l, name)
            if attr is not None:
                attr.data = attr.data.half()

model.apply(_convert_weights_to_fp16)

def build_model(state_dict: dict): vit = "visual.proj" in state_dict

if vit:
    vision_width = state_dict["visual.conv1.weight"].shape[0]
    vision_layers = len([k for k in state_dict.keys() if k.startswith("visual.") and k.endswith(".attn.in_proj_weight")])
    vision_patch_size = state_dict["visual.conv1.weight"].shape[-1]
    grid_size = round((state_dict["visual.positional_embedding"].shape[0] - 1) ** 0.5)
    image_resolution = vision_patch_size * grid_size
else:
    counts: list = [len(set(k.split(".")[2] for k in state_dict if k.startswith(f"visual.layer{b}"))) for b in [1, 2, 3, 4]]
    vision_layers = tuple(counts)
    vision_width = state_dict["visual.layer1.0.conv1.weight"].shape[0]
    output_width = round((state_dict["visual.attnpool.positional_embedding"].shape[0] - 1) ** 0.5)
    vision_patch_size = None
    assert output_width ** 2 + 1 == state_dict["visual.attnpool.positional_embedding"].shape[0]
    image_resolution = output_width * 32

embed_dim = state_dict["text_projection"].shape[1]
context_length = state_dict["positional_embedding"].shape[0]
vocab_size = state_dict["token_embedding.weight"].shape[0]
transformer_width = state_dict["ln_final.weight"].shape[0]
transformer_heads = transformer_width // 64
transformer_layers = len(set(k.split(".")[2] for k in state_dict if k.startswith(f"transformer.resblocks")))

model = CLIP(
    embed_dim,
    image_resolution, vision_layers, vision_width, vision_patch_size,
    context_length, vocab_size, transformer_width, transformer_heads, transformer_layers
)

for key in ["input_resolution", "context_length", "vocab_size"]:
    if key in state_dict:
        del state_dict[key]

convert_weights(model)
model.load_state_dict(state_dict)
return model.eval()

- export.py
```python
import os
import skimage
from PIL import Image
import numpy as np
import torch
import onnx
from onnxsim import simplify
import clip

clip.available_models()

load_model = 'ViT-B/32'

model, preprocess = clip.load(load_model)
model.cuda().eval()
model.float()
input_resolution = model.visual.input_resolution
context_length = model.context_length
vocab_size = model.vocab_size

print("Model parameters:", f"{np.sum([int(np.prod(p.shape)) for p in model.parameters()]):,}")
print("Input resolution:", input_resolution)
print("Context length:", context_length)
print("Vocab size:", vocab_size)

clip.tokenize("Hello World!")

# images in skimage to use and their textual descriptions
descriptions = {
    "astronaut": "a portrait of an astronaut with the American flag",
}

original_images = []
images = []
texts = []

for filename in [filename for filename in os.listdir(skimage.data_dir) if filename.endswith(".png") or filename.endswith(".jpg")]:
    name = os.path.splitext(filename)[0]
    if name not in descriptions:
        continue

    image = Image.open(os.path.join(skimage.data_dir, filename)).convert("RGB")
    original_images.append(image)
    images.append(preprocess(image))
    texts.append(descriptions[name])

image_input = torch.tensor(np.stack(images)).cpu()#.cuda()
text_tokens = clip.tokenize(["This is " + desc for desc in texts]).cuda()

model.visual.cpu()
model.visual(image_input)[0] # astronaut pic embedding
model(image_input, text_tokens.to(torch.int32))[0] # astronaut text embedding

onnx_file = f"clip_{load_model.lower().replace('-','_').replace('/','_')}.onnx"

torch.onnx.export(
    model,
    (image_input, text_tokens),
    onnx_file,
    opset_version=11,
)

model_onnx2 = onnx.load(onnx_file)
model_simp, check = simplify(model_onnx2)
onnx.save(model_simp, onnx_file)
python export.py
onnxsim clip_vit_b_32.onnx clip_vit_b_32.onnx
onnx2trt clip_vit_b_32.onnx -o clip_vit_b_32.trt -b 1 -d 16 -v
[2022-04-02 13:07:15 WARNING] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.5 but loaded cuBLAS/cuBLAS LT 11.6.1
[2022-04-02 13:07:15    INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4041, GPU 2430 (MiB)
[2022-04-02 13:07:15    INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4041, GPU 2438 (MiB)
[2022-04-02 13:07:15    INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +4, GPU +289, now: CPU 4, GPU 289 (MiB)
Writing TensorRT engine to clip_vit_b_32.trt
All done

clip_vit_b_32 onnx

javiabellan commented 1 year ago

@PINTO0309 How did you get such a neat and clean ONNX visualization of your CLIP model? When I use Netron, (even simplifying my model onnxsim) I get a messy graph like this:

Captura de pantalla 2022-12-01 a las 1 37 08
PINTO0309 commented 1 year ago

@javiabellan Do not start another discussion on a closed issue.