How to pre-train matrix Wi by CLIP to obtain image embedding?

UARK-AICV / VLCAP

[ICIP 2022] VLCap: Vision-Language with Contrastive Learning for Coherent Video Paragraph Captioning

https://ieeexplore.ieee.org/document/9897766

28 stars 5 forks source link

Closed gdg452 closed 1 year ago

gdg452 commented 1 year ago

24/5000 I'm sorry, I don't quite understand this part, can you elaborate? Thank you！

gdg452 commented 1 year ago

sorry，I misunderstood