Considering the captions in the GRIT dataset consist solely of noun words like berries, person ...
Did you use Templates to expand the captions, such as "a photo of a xxx"?
I believe GRIT have referring expression in the context instead of solely noun of words like this example below. We use their original referring expression without any template.
Thank you for your work.
Considering the captions in the GRIT dataset consist solely of noun words like berries, person ... Did you use Templates to expand the captions, such as "a photo of a xxx"?