dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.34k stars 207 forks source link

inference image captioning #32

Open trucvip123 opened 2 years ago

trucvip123 commented 2 years ago

who can do a demo for my image-captioning of ViLT. pleaseee!@! I'm a newbie in NLP field <33

dandelin commented 2 years ago

Hi @trucvip123,

Though ViLT has not undergone a captioning fine-tuning, you can emulate the captioning by passing text query as [MASK] [MASK] [MASK] ... [MASK] [MASK] ([MASK] * your desired length) to MLM demo.

trucvip123 commented 2 years ago

Thank you @dandelin