Open CaedenMotley opened 9 months ago
I guess a better question here is if there is anyway to receive the pixel data in relation to features before it is normalized and attributed to a patch rather than the pixels themselves. @patricklabatut I was wondering if you may be able to provide insight on this task. Thank you!
I'm not sure I understand what you're trying to do, but if you wanted something like a 'per-pixel feature vector', the most straight-forward way would be to reshape the patch tokens back into an image-like tensor (e.g. turn 1x225x1024 into 1x15x15x1024) and then interpolate that up to 210x210x1024.
The code would be something like
# Assume we have patch tokens (1x225x1024)
patch_tokens
# Convert to 1x1024x15x15
# (need height & width in the last spots for interpolation)
imagelike_tokens = patch_tokens.view(1,15,15,-1).permute(0,3,1,2)
# Scale up to 'per-pixel' sizing
# (has shape 1x1024x210x210)
perpixel_tokens = torch.nn.functional.interpolate(
imagelike_tokens,
size = (210,210),
mode="blinear"
)
I might have the reshaping/ordering wrong, I guess it depends on how the patch tokens were flattened to begin with, but hopefully the idea makes sense.
@heyoeyo That is what I am currently doing but the issue here is that it applies the same patch data to all the pixels rather than the pixels having their own individual features. I am trying to get back pixel features from dinov2 rather than patch features. Put simply upsampling does map the patch feature to pixels but each pixel will just have the same features as the patch; not their own unique features.
As far as I understand, the whole reason for the patch embedding is to reduce the number of unique 'pixels' that the model works on, since transformer-based models scale quadratically with the number of inputs. So going from a 14px patch size to a 1px (i.e. per-pixel) patch would require 14^2 ~ 200 times more compute and memory, which isn't feasible to run/train on current hardware. That being said, you could try reducing the patch size down to 1px (or as low as your hardware can handle) and get features that way? Although the existing model weights & positional embeddings aren't trained for that kind of input, so without additional training, the outputs may not make sense.
Alternatively, you could try using convolution-based models, since they scale better. I think something like a U-net model should be able to generate a dense per-pixel output.
As far as I understand, the whole reason for the patch embedding is to reduce the number of unique 'pixels' that the model works on, since transformer-based models scale quadratically with the number of inputs. So going from a 14px patch size to a 1px (i.e. per-pixel) patch would require 14^2 ~ 200 times more compute and memory, which isn't feasible to run/train on current hardware. That being said, you could try reducing the patch size down to 1px (or as low as your hardware can handle) and get features that way? Although the existing model weights & positional embeddings aren't trained for that kind of input, so without additional training, the outputs may not make sense.
Alternatively, you could try using convolution-based models, since they scale better. I think something like a U-net model should be able to generate a dense per-pixel output.
Hello, I want to know if the patch is in original space. I mean, the image is patched and fed into model. The patch has its own spatial relationship. If this spatial relation can be recall in the embedding like shape (1,225,1024)? Can I just rearrange it with its own order?
Hello, I want to know if the patch is in original space
I'm not sure what you mean by 'original space', but as you mentioned, the patches do have their own spatial relationship. They're basically a smaller (in terms of width & height) version of the original input image.
The output of the patch embedding model usually converts the patches from an 'image-like' shape into a 'rows of tokens' shape (here's a simple example diagram), which is more convenient for the vision transformer, but makes the spatial relationship harder to follow.
If you'd like to re-arrange the patches from the 'rows of tokens' format back into something like an image, you can copy the code from the patch embedding model:
image_like_patches = rows_of_patches.reshape(-1, H, W, C)
Where W
and H
are the original width & height of the patches (for dinov2, this will be equal to the input image width or height divided by 14) and C
is the number of features per patch (aka the 'embedding dimension'), which varies by model (I think it's 384/768/1024 for vit small/base/large, for example).
I am trying to retrieve the pixel data in relation to features rather than patches in relation to features. My issue is that I can not seem to find a way to receive the original pixel data contained within the patches. for example when a 210 x 210 image is passed through it will return a "x_norm_patchtokens" tensor of shape (1,225,1024). I would like to somehow transform this to be (210 x 210 x 1024) using the pixels contained within each patch rather than just the patch as a singular element. My original thought was to reshape into (√(225 x 14^2), √(225 x 14^2), 1024) but this will obviously yield a size much greater than the original tensor. Is this retrieval possible and if so any help as to how would be greatly appreciated. Thank you!