Select a pool of models, all pre-trained on the same dataset

jack89roberts commented 1 year ago

For phase 1:

Pre-trained on ImageNet and quotes achieved performance.
Variety of architectures and sizes.
At least 5 models
Make a table of model_name, model_size, imagenet-1k-top1-accuracy

jack89roberts commented 1 year ago

microsoft/beit e.g. https://huggingface.co/microsoft/beit-large-patch16-512
microsoft/swin e.g. https://huggingface.co/microsoft/swin-large-patch4-window12-384-in22k
microsoft/cvt e.g. https://huggingface.co/microsoft/cvt-13
facebook/levit e.g. https://huggingface.co/facebook/levit-128S
facebook/deit e.g. https://huggingface.co/facebook/deit-base-patch16-224
apple/mobilevit e.g. https://huggingface.co/apple/mobilevit-small
google/vit e.g. https://huggingface.co/google/vit-base-patch16-224
timm/eva e.g. https://huggingface.co/timm/eva02_base_patch14_224.mim_in22k
timm/vit e.g. https://huggingface.co/timm/vit_base_patch16_224.augreg_in21k
timm/swin e.g. https://huggingface.co/timm/swin_base_patch4_window7_224.ms_in22k_ft_in1k
timm/beit e.g. https://huggingface.co/timm/beit_base_patch16_224.in22k_ft_in22k_in1k
timm/levit e.g. https://huggingface.co/timm/levit_128.fb_dist_in1k
timm/mobilevit e.g. https://huggingface.co/timm/mobilevit_s.cvnets_in1k
timm/[a lot of other things]

Some have nice tables comparing the variants of their models, e.g.

Watch out for some models being fine-tuned on ImageNet 1k and some on ImageNet 21/22k

jack89roberts commented 1 year ago

Base classes relevant for the above (the forward methods can help to identify what layer/module we want to take outputs from, for example):

ViTForImageClassification

outputs = self.vit(pixel_values)
sequence_output = outputs[0]
logits = self.classifier(sequence_output[:, 0, :])

DeiTForImageClassification

outputs = self.deit(pixel_values)
sequence_output = outputs[0]
logits = self.classifier(sequence_output[:, 0, :])

BeitForImageClassification

outputs = self.beit(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(pooled_output)

SwinForImageClassification

outputs = self.swin(pixel_values)
pooled_output = outputs[1]
logits = self.classifier(pooled_output)

CvtForImageClassification

outputs = self.cvt(
  pixel_values,
  output_hidden_states=output_hidden_states,
  return_dict=return_dict,
)
sequence_output = outputs[0]
cls_token = outputs[1]
if self.config.cls_token[-1]:
  sequence_output = self.layernorm(cls_token)
else:
  batch_size, num_channels, height, width = sequence_output.shape
  # rearrange "b c h w -> b (h w) c"
  sequence_output = sequence_output.view(batch_size, num_channels, height * width).permute(0, 2, 1)
  sequence_output = self.layernorm(sequence_output)

sequence_output_mean = sequence_output.mean(dim=1)
logits = self.classifier(sequence_output_mean)

LevitForImageClassification

outputs = self.levit(pixel_values)
sequence_output = outputs[0]
sequence_output = sequence_output.mean(1)
logits = self.classifier(sequence_output)

MobileViTForImageClassification

outputs = self.mobilevit(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(self.dropout(pooled_output))

MobileViTV2ForImageClassification

outputs = self.mobilevitv2(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(pooled_output)

Maybe also the WithTeacher classes (to investigate), e.g.

DeiTForImageClassificationWithTeacher

outputs = self.deit(pixel_values)
sequence_output = outputs[0]
cls_logits = self.cls_classifier(sequence_output[:, 0, :])
distillation_logits = self.distillation_classifier(sequence_output[:, 1, :])
logits = (cls_logits + distillation_logits) / 2

LevitForImageClassificationWithTeacher

outputs = self.levit(pixel_values)
sequence_output = outputs[0]
sequence_output = sequence_output.mean(1)
cls_logits, distill_logits = self.classifier(sequence_output), self.classifier_distill(sequence_output)
logits = (cls_logits + distill_logits) / 2

All use self.<ARCHITECTURE_NAME> call followed by a self.classifier call with some kind of processing/slicing between to either select the first row of the output (e.g. corresponding to the input <cls> token in ViT), or to pool/mean across the outputs per patch in some way. The WithTeacher classes have an additional distillation_classifier.

We want the processed sequence_output/pooled_output values as our features for most the metrics.

eddableheath commented 12 months ago

Putting Jack's note from slack here for posterity:

Looking at the HuggingFace source code I think setting num_labels=0 when loading the models might give us the features without needing to implement anything else ourselves

jack89roberts commented 11 months ago

Removing this from WP1 now. We have some initial results with:

google/vit-base-patch16-384
microsoft/cvt-13
facebook/deit-small-patch16-224
facebook/deit-tiny-patch16-224
facebook/deit-base-patch16-224

Need to pick a bigger pool of models for the later work packages.

eddableheath commented 8 months ago

Collection of every transformers model pretrained on ImageNet from the huggingface transformers library:

Google:

Facebook:

Microsoft

Apple

MobileVIT

Intel

VIT
- Intel/vit-base-patch16-224-int8-static

Visual-Attention-Network

VAN

Matthijs

MoibleVIT
- Matthijs/mobilevit-small

optimum

VIT
- optimum/vit-base-patch16-224

fxmarty

LEVIT
- fxmarty/levit-256-onnx

shi-labs

NAT

Xrenya

PVT

Snarci

SWIN
- Snarci/SwinV2-Chaoyang

MBZUAI

swiftformer

shehan97

MobileVIT
- shehan97/mobilevitv2-1.0-imagenet1k-256

Zetatech

PVT

tensorgirl

VIT
- tensorgirl/TFaugvit

grlh11

VIT
- glrh11/vit-base-patch16-224

eddableheath commented 8 months ago

~ 80 models listed above. Do we have the scope to fine-tune all of them on every dataset we pick?

And do we have stipulations on the source of these models other than compatability with our framework? I've ordered the sources (i.e. where these models are from and who is responsible to uploading them to the model hub) with well known companies first.

eddableheath commented 5 months ago

Should this be closed?

alan-turing-institute / ARC-LoCoMoSeT

Select a pool of models, all pre-trained on the same dataset #5