dandelin / ViLT

Code for the ICML 2021 (long talk) paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"
Apache License 2.0
1.36k stars 209 forks source link

FLOPS calculation #23

Open junchen14 opened 2 years ago

junchen14 commented 2 years ago

hi when you compute the FLOPS in table 6 for baseline models such as ViLBERT, do you also include the FLOPS computation of feature extraction models?

dandelin commented 2 years ago

Hi @junchen14,

Yes, we calculated FLOPs by summing up those of object detection backbone + object detection RCNN + NMS + modality interaction transformer for object detection-based vision-and-language models.