Hardcoded values for patch-size in the ViT backbone

Foxigod commented 2 months ago

Describe the issue I'm encountering issues during fine-tuning that seem to be stemming from the patch size being hardcoded as 16x16 in the ViT backbone. While there are multiple cases of this in the ViT_encoder_decoder.py file, one specific location are lines 438-440. I believe these hardcoded values of 16 should actually reference the patch-size that the model was instantiated with.

Reproduce the issue

Setup a CLI config file to instantiate a model with a patch-size different frome 16. In my case 3, with an image size of 15x15 pixels.
Bypass the reuse of patch-embeddings if the pre-trained model was trained on a patch-size of e.g. 16. I did this by monkey-patching the prithvi_select_patch_embed_weights function. (This could be another issue by itself).
Run the fine-tuning from command line.

Encounter issues like these:

File "/p/project1/geofm4eo/eli1/terratorch/terratorch/models/backbones/vit_encoder_decoder.py", line 442, in forward_features
x = x + pos_embed[1:, :]
    ~~^~~~~~~~~~~~~~~~~~
RuntimeError: The size of tensor a (300) must match the size of tensor b (0) at non-singleton dimension 1

Deployment information (optional) I installed terratorch by cloning the repo and running pip3 install -e <cloned_directory>. If I run git rev-parse HEAD I get the following output:

3b763912d54594f5e89092ca5999178b399c2a10

which is this commit.

Joao-L-S-Almeida commented 2 months ago

@CarlosGomes98 I would like to know the references used to implement this transform, but I understand that here:

 if embed_dim % 16 != 0:
        msg = "Embed dim must be divisible by 16"
        raise Exception(msg)

We should check the division by patch_size. The same here:

   w_embed_dim = embed_dim // 16 * 6
    h_embed_dim = embed_dim // 16 * 6
    t_embed_dim = embed_dim // 16 * 4

But I'm not sure about the other hard-coded integer values.

Foxigod commented 2 months ago

I was under the impression that it was precisely not this particular segment of the code you point out that is causing these issues. Correct me if I'm wrong, but this segment seems to divide the size of the embedding between the width-spatial embedding, height-spatial embedding and the temporal-embedding in a manner that requires it to be divisible by 16.

The segment of code I point to is calling the get_3d_sincos_pos_embed function with the grid-size as a parameter, calculated from hardcoded values assuming tubelet_size=1, and patch_size=16. In the __init__() function of this TemporalViTEncoder class, the get_3d_sincos_pos_embed function is also called, but this time referencing the grid size from the PatchEmbed instantiated class which is calculated with the actual values of tubelet_size and patch_size supplied to the model. This reference to a variable of this instantiated class is a bit obscure though, so I would probably either:

save the calculated grid_size as a TemporalViTEncoder instance variable instead of relying on the instance variable from the PatchEmbed class
or alternatively calculate the grid-size when it is needed so it can be based on the dimensions of the supplied batches.

However I don't fully understand the purpose of the get_3d_sincos_pos_embed call from the __init__() function, and the grid_size it uses is also based on the pretrain_img_size parameter that was supplied to the PatchEmbed class, while my intuition tells me this should actually be the current (i.e. fine-tuning) image size. This intuition also attains to the instantiation of the PatchEmbed class, why does that take in the pretrain_img_size and not a fine-tuning-image-size?

IBM / terratorch

Hardcoded values for patch-size in the ViT backbone #112