Closed jack89roberts closed 5 months ago
Some have nice tables comparing the variants of their models, e.g.
Watch out for some models being fine-tuned on ImageNet 1k and some on ImageNet 21/22k
Base classes relevant for the above (the forward
methods can help to identify what layer/module we want to take outputs from, for example):
outputs = self.vit(pixel_values)
sequence_output = outputs[0]
logits = self.classifier(sequence_output[:, 0, :])
outputs = self.deit(pixel_values)
sequence_output = outputs[0]
logits = self.classifier(sequence_output[:, 0, :])
outputs = self.beit(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(pooled_output)
outputs = self.swin(pixel_values)
pooled_output = outputs[1]
logits = self.classifier(pooled_output)
outputs = self.cvt(
pixel_values,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
cls_token = outputs[1]
if self.config.cls_token[-1]:
sequence_output = self.layernorm(cls_token)
else:
batch_size, num_channels, height, width = sequence_output.shape
# rearrange "b c h w -> b (h w) c"
sequence_output = sequence_output.view(batch_size, num_channels, height * width).permute(0, 2, 1)
sequence_output = self.layernorm(sequence_output)
sequence_output_mean = sequence_output.mean(dim=1)
logits = self.classifier(sequence_output_mean)
outputs = self.levit(pixel_values)
sequence_output = outputs[0]
sequence_output = sequence_output.mean(1)
logits = self.classifier(sequence_output)
outputs = self.mobilevit(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(self.dropout(pooled_output))
outputs = self.mobilevitv2(pixel_values, return_dict=return_dict)
pooled_output = outputs.pooler_output if return_dict else outputs[1]
logits = self.classifier(pooled_output)
Maybe also the WithTeacher
classes (to investigate), e.g.
outputs = self.deit(pixel_values)
sequence_output = outputs[0]
cls_logits = self.cls_classifier(sequence_output[:, 0, :])
distillation_logits = self.distillation_classifier(sequence_output[:, 1, :])
logits = (cls_logits + distillation_logits) / 2
outputs = self.levit(pixel_values)
sequence_output = outputs[0]
sequence_output = sequence_output.mean(1)
cls_logits, distill_logits = self.classifier(sequence_output), self.classifier_distill(sequence_output)
logits = (cls_logits + distill_logits) / 2
All use self.<ARCHITECTURE_NAME>
call followed by a self.classifier
call with some kind of processing/slicing between to either select the first row of the output (e.g. corresponding to the input <cls>
token in ViT), or to pool/mean across the outputs per patch in some way. The WithTeacher
classes have an additional distillation_classifier
.
We want the processed sequence_output
/pooled_output
values as our features for most the metrics.
Putting Jack's note from slack here for posterity:
Looking at the HuggingFace source code I think setting
num_labels=0
when loading the models might give us the features without needing to implement anything else ourselves
Removing this from WP1 now. We have some initial results with:
Need to pick a bigger pool of models for the later work packages.
Collection of every transformers model pretrained on ImageNet from the huggingface transformers library:
Google:
Facebook:
Microsoft
Apple
Intel
Visual-Attention-Network
Matthijs
optimum
fxmarty
shi-labs
Xrenya
Snarci
MBZUAI
shehan97
Zetatech
tensorgirl
grlh11
~ 80 models listed above. Do we have the scope to fine-tune all of them on every dataset we pick?
And do we have stipulations on the source of these models other than compatability with our framework? I've ordered the sources (i.e. where these models are from and who is responsible to uploading them to the model hub) with well known companies first.
Should this be closed?
For phase 1:
model_name, model_size, imagenet-1k-top1-accuracy