It could be useful to estimate the parameters inside the VRNN using multiple datasets. The more image--description pairs we see, the more reliable the model parameters.
Specifically, we only have 16,000 image--description pairs in the IAPR-TC12 dataset. This is less image--description pairs than in the Flickr8K training data (6,000 training images => 30,000 image--description training instances.)
It could be useful to estimate the parameters inside the VRNN using multiple datasets. The more image--description pairs we see, the more reliable the model parameters.
Specifically, we only have 16,000 image--description pairs in the IAPR-TC12 dataset. This is less image--description pairs than in the Flickr8K training data (6,000 training images => 30,000 image--description training instances.)