As shown in the Supplementary Materials of the proposed method, the channel dimension of feature after Extractor, which need to be added to position embeding, is 64. But in Subsection 4.1 of the main paper, it's noted that the dimension 'd' should be divisible
by 3 since the positional encodings of the three dimensions should be concatenated to form the final 'd' channel positional encodings. However, 64 can't be divisible by 3.
So, how to implement the Spatial-temporal positional encoding? I am looking forward to your reply as soon as possible.
As shown in the Supplementary Materials of the proposed method, the channel dimension of feature after Extractor, which need to be added to position embeding, is 64. But in Subsection 4.1 of the main paper, it's noted that the dimension 'd' should be divisible by 3 since the positional encodings of the three dimensions should be concatenated to form the final 'd' channel positional encodings. However, 64 can't be divisible by 3.
So, how to implement the Spatial-temporal positional encoding? I am looking forward to your reply as soon as possible.