Hi,
I wonder if we feed the network the whole image and use convolutional sliding window technique ( which instead of outputting 1128 -> n 128) where n is number of patches in the image(given that the patch size 32 * 32) and retrain the network, will it enhance the model performance in image description or it will be harder to train ?
Thanks in advance
Hi, I wonder if we feed the network the whole image and use convolutional sliding window technique ( which instead of outputting 1128 -> n 128) where n is number of patches in the image(given that the patch size 32 * 32) and retrain the network, will it enhance the model performance in image description or it will be harder to train ? Thanks in advance