Closed QIN2DIM closed 10 months ago
Convert model to OpenVINO Intermediate Representation (IR) format
PyTorch - EXPORTING A MODEL FROM PYTORCH TO ONNX AND RUNNING IT USING ONNX RUNTIME
https://huggingface.co/laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K
https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K
https://github.com/facebookresearch/MetaCLIP#pre-trained-models
https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark.png (Commits on Oct 17, 2022)
https://github.com/mlfoundations/open_clip/blob/main/docs/openclip_results.csv (Commits on Oct 22, 2023)
https://github.com/huggingface/pytorch-image-models/tree/main/results (Commits on May 25, 2023)
Learning Transferable Visual Models From Natural Language Supervision [Submitted on 26 Feb 2021]
Reproducible scaling laws for contrastive language-image learning [Submitted on 14 Dec 2022]
[Submitted on 27 Mar 2023]
[Submitted on 14 Apr 2023]
[Submitted on 3 Nov 2021]
[Submitted on 16 Oct 2022]
[Submitted on 27 Apr 2023 (v1), last revised 25 Jul 2023 (this version, v4)]
import torch
torch.onnx.export(
model, # model being run
# model input in one of acceptable format: torch.Tensor (for single input), tuple or list of tensors for multiple inputs or dictionary with string keys and tensors as values.
dict(inputs),
"clip-vit-base-patch16.onnx", # where to save the model
opset_version=14, # the ONNX version to export the model to
input_names=["input_ids", "pixel_values", "attention_mask"], # the model's input names
output_names=["logits_per_image", "logits_per_text", "text_embeds", "image_embeds"], # the model's output names
dynamic_axes={ # variable length axes
"input_ids": {0: "batch", 1: "sequence"},
"pixel_values": {0: "batch", 1: "num_channels", 2: "height", 3: "width"},
"attention_mask": {0: "batch", 1: "sequence"},
"logits_per_image": {0: "batch"},
"logits_per_text": {0: "batch"},
"text_embeds": {0: "batch"},
"image_embeds": {0: "batch"}
}
)
Intro
See the example code for details.
The CLIP multimodal model enables zero-shot image classification. I've tested this on multiple datasets and the model is over 99.9% accurate, as long as an appropriate prompt is provided.
We just need to write positive_labels and negative_labels based on the cue words of the known challenge (image_binary_challenge). If a new prompt is encountered that has never been processed before, the program automatically performs the conversion and adjustment for the dichotomous task.
We tried to reproduce the process module using numpy, i.e., we did not need to rely on PyTorch to implement the process.
By default, we use the
RN50.openai
specification of the model for classification tasks. We encapsulate the activation of both the ONNX and VitTransformer Pipeline branches so that the program switches automatically when you have both torch and transformers installed in your runtime environment and a CUDA GPU available. Otherwise, it defaults to using ONNX and running on a CPU.https://github.com/QIN2DIM/hcaptcha-challenger/blob/901afd1dbf97ac25191ec6ea2398daab9db97773/hcaptcha_challenger/onnx/modelhub.py#L245-L259
DEMO
https://github.com/QIN2DIM/hcaptcha-challenger/blob/d38be1b3f148f4368e77b87cb74bccc826cf3117/src/objects.yaml#L553-L574