Closed Foxigod closed 1 month ago
@CarlosGomes98 I would like to know the references used to implement this transform, but I understand that here:
if embed_dim % 16 != 0:
msg = "Embed dim must be divisible by 16"
raise Exception(msg)
We should check the division by patch_size
.
The same here:
w_embed_dim = embed_dim // 16 * 6
h_embed_dim = embed_dim // 16 * 6
t_embed_dim = embed_dim // 16 * 4
But I'm not sure about the other hard-coded integer values.
I was under the impression that it was precisely not this particular segment of the code you point out that is causing these issues. Correct me if I'm wrong, but this segment seems to divide the size of the embedding between the width-spatial embedding, height-spatial embedding and the temporal-embedding in a manner that requires it to be divisible by 16.
The segment of code I point to is calling the get_3d_sincos_pos_embed
function with the grid-size as a parameter, calculated from hardcoded values assuming tubelet_size=1
, and patch_size=16
. In the __init__()
function of this TemporalViTEncoder
class, the get_3d_sincos_pos_embed
function is also called, but this time referencing the grid size from the PatchEmbed instantiated class which is calculated with the actual values of tubelet_size
and patch_size
supplied to the model. This reference to a variable of this instantiated class is a bit obscure though, so I would probably either:
TemporalViTEncoder
instance variable instead of relying on the instance variable from the PatchEmbed
classHowever I don't fully understand the purpose of the get_3d_sincos_pos_embed
call from the __init__()
function, and the grid_size
it uses is also based on the pretrain_img_size
parameter that was supplied to the PatchEmbed
class, while my intuition tells me this should actually be the current (i.e. fine-tuning) image size. This intuition also attains to the instantiation of the PatchEmbed
class, why does that take in the pretrain_img_size
and not a fine-tuning-image-size
?
Describe the issue I'm encountering issues during fine-tuning that seem to be stemming from the patch size being hardcoded as 16x16 in the ViT backbone. While there are multiple cases of this in the ViT_encoder_decoder.py file, one specific location are lines 438-440. I believe these hardcoded values of 16 should actually reference the patch-size that the model was instantiated with.
Reproduce the issue
Deployment information (optional) I installed terratorch by cloning the repo and running
pip3 install -e <cloned_directory>
. If I run git rev-parse HEAD I get the following output:which is this commit.