ChenDelong1999 / RemoteCLIP

🛰️ Official repository of paper "RemoteCLIP: A Vision Language Foundation Model for Remote Sensing" (IEEE TGRS)
https://arxiv.org/abs/2306.11029
Apache License 2.0
226 stars 13 forks source link

data leakage issue of RSICD and RSITMD #26

Open YiguoHe opened 3 months ago

YiguoHe commented 3 months ago

The RSITMD and RSICD datasets have a data leakage issue where they might share some common images and descriptions. how to deal with it properly?

gzqy1026 commented 2 months ago

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

YiguoHe commented 1 month ago

You can calculate the distance between two images by hash values if there are duplicates in two datasets. If the distance is less than a certain threshold, it is defined as a duplicate image. It is recommended to manually check the deduplicated images in the code to avoid filtering out some images that are not actually duplicates.

Thank you for your response. Your work is excellent. Best wishes!