Open DanTaranis opened 1 year ago
Sorry for the troubling. Please refer to the following code for ViT-16.
img_size=[224, 56, 28] feat_size=[56, 28, 14] rel_scale1 = int(feat_size[0] / feat_size[2]) rel_scale2 = int(feat_size[1] / feat_size[2]) mask_for_patch1 = mask.reshape(-1, feat_size[-1], feat_size[-1]).unsqueeze(-1).repeat(1, 1, 1, rel_scale1 ** 2).reshape(-1, feat_size[-1], feat_size[-1], rel_scale1, rel_scale1).permute(0, 1, 3, 2, 4).reshape(x.shape[0], feat_size[0], feat_size[0]).unsqueeze(1)
You need to modify the stride for self.stage1_output_decode / self.stage2_output_decode
Hi - I'd like to do patches of size 32x32, and a smaller model in general. any thing I change breaks the entire code. It would be really helpful if you refactored out all of the places that specify 4,2,16...etc throughout the code for MaskedAutoencoderConvViT
Thanks, Dan