Thank you for your work and sharing. The visual features used for R2R are extracted by ViT-clip, but the ones used for REVERIE seems to be extracted by ViT. Could you please share the rvr_best checkpoint that is trained on ViT-clip? Thank you anyway.
Thank you for your work and sharing. The visual features used for R2R are extracted by ViT-clip, but the ones used for REVERIE seems to be extracted by ViT. Could you please share the rvr_best checkpoint that is trained on ViT-clip? Thank you anyway.