Hi ,
Could you please give more information about Figure 4 in the paper ?
In my understanding, you choose regions based ground-truth captions in the dataset for Controllability through a sequence of detections. In experiment of figure 4, how do you choose a set of regions for an image ?
Hi , Could you please give more information about Figure 4 in the paper ? In my understanding, you choose regions based ground-truth captions in the dataset for Controllability through a sequence of detections. In experiment of figure 4, how do you choose a set of regions for an image ?