Francis-Rings / ILA

31 stars 3 forks source link

Not aligned with the paper #5

Open GuozhenZhang1999 opened 1 year ago

GuozhenZhang1999 commented 1 year ago

I observed that the structure of X-CLIP is used to obtain video features in the code, which is not consistent with the average pooling described in the paper. Can you give some reasonable explanations?

Francis-Rings commented 11 months ago

Thanks for your questions! The prediction block consists with convolution and pooling operation which is one of ILA components. The aligned tokens can be concatenated with original tokens or added to original tokens In each CLIP block. We find that adding operation may be more stable during training process. These two operations will not have a significant impact on model performance .

rbsohee commented 10 months ago

I have a question related to the issue at hand. In your paper, 'Table 9' indicates that element-wise addition performs less effectively than pool & concat. However, in the current discussion, it's mentioned that these two operations don't significantly affect performance. Could you clarify whether the element-wise addition implementation in the paper differs from the current code implementation?