Closed yangzhip closed 3 years ago
Hello. I'm not the author of the original paper. I didn't see the original code either. Thus, I don't have CV datasets and don't know how they preprocess the data.
My data is from the remote sensing domain (RSICD and UCM datasets):
My code for the data preparation will be available later in another repository.
Hello. I'm not the author of the original paper. I didn't see the original code either. Thus, I don't have CV datasets and don't know how they preprocess the data.
My data is from the remote sensing domain (RSICD and UCM datasets):
- BERT (bert-base-uncased, you can find the model on Hugging Face) for caption feature encoding. I use a sum over last 4 hidden states -> (768,) text feature vectors
- ResNet18 (trained on ImageNet, available as a module of torchvision) for image feature encoding. Last classification layer is removed - > (512,) image feature vectors
My code for the data preparation will be available later in another repository. Thanks for the answer.Is the experimental result of your code consistent with the original paper?
I didn't test it on the original CV data. But for my datasets the performance was good.
I didn't test it on the original CV data. But for my datasets the performance was good.
ok,thank you a lot!
你好,能上传一下其他数据集和对数据集的处理是如何进行的呢