GWUvision / Hotels-50K

137 stars 23 forks source link

Broken links #4

Open pvgladkov opened 5 years ago

pvgladkov commented 5 years ago

I see too much broken links in train_set.csv. From 1,027,871 images I downloaded only 565,002. I would like to use this dataset as a benchmark for comparing different approaches (including yours). But your evaluation method assumes the presence of all images. Could you provide the full dataset?

abby621 commented 5 years ago

Expedia seems to be in the process of changing their URL formats. We are going through to locate updated URLs for the broken images using the new URL format, and will post an updated train_set.csv as soon as it's ready. Apologies for the current broken images!

pvgladkov commented 5 years ago

Great! Thanks a lot!

bkj commented 5 years ago

Any updates on this? I'd like to download the dataset, but I'm hitting a large number of broken links as well.

Alternatively -- do you have a .tar.gz of the dataset that you'd be able to share?

Thanks! ~ Ben

av-savchenko commented 4 years ago

Thanks for gathering this dataset! However, the issue with unresolved urls seems to be unresolved yet. I sucessfully downloaded only 250,463 images. Do you have any updates? Is it possible to share all images as suggested in the previous comment?

abby621 commented 4 years ago

Hi! For copyright reasons, we cannot release the specific images. We have been trying to determine if there is a new mapping for the broken images, but that does not seem to be the case. We will be releasing an updated dataset and report on results, and are working to see if we can get permission to share actual images rather than URLs.

Apologies for the delays; I got caught up in my first semester as a professor and this has taken longer for me to resolve than I had hoped/expected.

virginianegri commented 4 years ago

Hi! Are there any updates on this? Is there a projected date for the release of the updated dataset? I would like to use this as part of my master thesis project. Thank you!!

Pyzow commented 4 years ago

+1 for curiousity of an update. Let me know if there's any way that I assist.

abby621 commented 4 years ago

Hi all! Apologies for the delayed update.

The repository has been updated with valid, downloadable imagery (the specific updates files are the dataset files in input/dataset.tar.gz and the test image tar ball which has an updated link in the repository). Due to copyright issues, we still provide links for all of the training imagery which has to be downloaded (the download_train.py file has also been updated to support downloading the updated imagery). This means that there remains the possibility that the travel website imagery may move again in the future. We are working to see if we can work out a solution to this with the imagery providers, but in the meantime, we hope that we have a functional solution for the foreseeable future.

There were a small number of the hotels from the original test set that no longer had any valid gallery images (due to there no longer being any working travel website images). Those test images have been deleted from the test set. There were also a few hundred training hotels that no longer had valid imagery. We have replaced those with new classes, leaving the number of classes in the gallery at 50,000.

I will be posting updated retrieval and classification results in the coming weeks. My hypothesis is that they won't be hugely different from those reported in the paper, but we will make sure to include the results in the repository, both for the method described in the Hotels-50K AAAI paper, and the new state of the art approach using Easy Positive Triplet Mining (presented at WACV2020, https://arxiv.org/abs/1904.04370).