Open Leon1207 opened 8 months ago
Nowadays, researchers are using large language models for image captioning. We identify our caption head as a "light-weight" design to support the possibility of set-to-set training.
Thank you very much for your reply! Your explanation makes perfect sense! On the other hand, if methods like 3DJCG or D3Net don't use large models, are we lightweight enough?
As long as they contain a small amount of parameters, you can also call them "light-weight".
Thanks!
Dear authors. I have some questions about the lightweight caption head you proposed! How does the lightweight caption head differ from existing captioning models in terms of architecture and computational efficiency so that it's a "lightweight design"? Hope for your reply.