Closed alexatartaglini closed 2 years ago
one detail we should double check when implementing this is that the decision from a base ViT is done using the embedding at the very first position of the transformer network (see in Figure 1 for the extra learnable [class] embedding), so we'd probably want to replicate extracting only this embedding (rather than all of the embeddings across all image patches plus this one) for our simulations
Need to implement the supervised ViTs from this paper: https://arxiv.org/abs/2010.11929
Pre-trained weights can be downloaded from one of two sources: