Could I ask your STAN-self-B/16 training time in your paper.
And I really be astonished at frame number@12 and batch size@128, which means one forward need to process 1536 images, and images also will be transformed into patches, it's a huge sequence length for self-attention this operator in either temporal or spatial dimension. Would it take many gpu memory usage?
The training of CLIP-B/16 does be expensive, because contrastive learning is sensitive to batch size and 128 is uncompromising. On 16 A100, it takes 12 hours to train on MSRVTT and 4 hours to train on DiDemo
Could I ask your STAN-self-B/16 training time in your paper. And I really be astonished at frame number@12 and batch size@128, which means one forward need to process 1536 images, and images also will be transformed into patches, it's a huge sequence length for self-attention this operator in either temporal or spatial dimension. Would it take many gpu memory usage?