farewellthree / STAN

Official PyTorch implementation of the paper "Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring"
Apache License 2.0
89 stars 3 forks source link

Training time and GPU memory usage #3

Closed laisimiao closed 1 year ago

laisimiao commented 1 year ago

Could I ask your STAN-self-B/16 training time in your paper. And I really be astonished at frame number@12 and batch size@128, which means one forward need to process 1536 images, and images also will be transformed into patches, it's a huge sequence length for self-attention this operator in either temporal or spatial dimension. Would it take many gpu memory usage?

farewellthree commented 1 year ago

The training of CLIP-B/16 does be expensive, because contrastive learning is sensitive to batch size and 128 is uncompromising. On 16 A100, it takes 12 hours to train on MSRVTT and 4 hours to train on DiDemo