question about image2text training？

AILab-CVC / SEED

Official implementation of SEED-LLaMA (ICLR 2024).

https://ailab-cvc.github.io/seed

Other

576 stars 31 forks source link

Closed zzchust closed 1 year ago

zzchust commented 1 year ago

1. 图像->文本：这里输入的是图像的token-ID还是vector啊，我看这里说对图像的表示用FC进行了处理映射，然后再和text embeeding串起来的？

geyuying commented 1 year ago

输入的是图像的token-ID在visual codebook里对应的visual codes(vector)，再接FC（使得和word embedding的dimenison对齐）。