vtab1k dataset splits - Githubissues

zhaoedf commented 2 years ago

hi, for the vtab-1k benchmark, we need to use the tensorflow api to get the exact dataset splits, which is quite hard for people from mainland, China.

i was wondering if you could upload the splits txt files to this repo? thx

KMnP commented 2 years ago

Hi, thanks for your interest in our work.

We don't have the split information right away since the data handling pipeline is wrapped inside the tfds library. We will need to update the vtab dataset folder and TFDataset class to generate filenames for train/val split of 19 vtab-1k dataset.

We plan to do it soon and release the split info. Of course we are welcome any help to accelerate the process!

Thanks!

Tsingularity commented 2 years ago

Hi @zhaoedf , we just uploaded the vtab train split info to the vtab data release Google Drive/Dropbox. In the file vtab_trainval_splits.json, for each dataset, u can find the filenames of the randomly selected 1k training examples used in our experiment. We got them by extracting the ‘filename’ attribute from the tensorflow dataset feature dict.

Unfortunately, because there’s no such info for dsprite, smallnorb and svhn in the tensorflow dataset format, we cannot provide the splits for these 3 datasets.

Feel free to let us know if there’s anything else we can help with!

zhaoedf commented 2 years ago

what do you mean by "no such info for dsprite, smallnorb and svhn". as i can see, when you run the vtab1k official repo scripts, it automately downloads all datasets and generates all 1k images and test set. so how did you conduct your exps?
i managed to get access to GCS and downloaded multiple datasets, however, due to the checksum error in tensorflow_datasets(tfds), i cannot generate the 1k images for sun397, patch_camelyon, clevr, diabetic_retinopathy and dtd. so it would be convenient if you could upload theses 3 dataset 1k images to the web (any drive is ok)?

Tsingularity commented 2 years ago

We released the image_id or filenames of the vtab dataset splits (train and val), as per your original request. In the process of obtaining those info, we found that tensorflow_datasets does not contain filenames for these 3 datasets: dsprite, smallnorb and svhn, so we were not able to retrieve the filenames of the training images. And your understanding of the code is correct. It will automatically download and prepare the datasets. However, they are originally in the form of tf tensors and we wrote another class to convert them to torch tensors.
I am really sorry about the inconvenience. We also thought about directly uploading the pre-processed dataset when releasing the code. However, as per the company’s legal requirements, we are not allowed to redistribute any third-parties’ data. Hope you could understand.

Tsingularity commented 2 years ago

closing this for now. feel free to re-open it if you need more help.

zhaoedf commented 2 years ago

We released the image_id or filenames of the vtab dataset splits (train and val), as per your original request. In the process of obtaining those info, we found that tensorflow_datasets does not contain filenames for these 3 datasets: dsprite, smallnorb and svhn, so we were not able to retrieve the filenames of the training images. And your understanding of the code is correct. It will automatically download and prepare the datasets. However, they are originally in the form of tf tensors and we wrote another class to convert them to torch tensors.

I am really sorry about the inconvenience. We also thought about directly uploading the pre-processed dataset when releasing the code. However, as per the company’s legal requirements, we are not allowed to redistribute any third-parties’ data. Hope you could understand.

for anyone interested in the same problem, i use the official vtab-1k repo code and successfully generate dsprite, smallnorb and svhn, but sun397, patch_camelyon, clevr, diabetic_retinopathy and dtd are still not available due to tfds problem.

Tsingularity commented 2 years ago

@zhaoedf so what're the error message u got for the 5 datasets (sun397, patch_camelyon, clevr, diabetic_retinopathy and dtd) u cannot generate? We have already released the 1k train/val filenames for these datasets so I think one possible solution might be that, u download these data manually and split them using our info. Let us know if these could work for you.

For those 3 datasets (dsprite, smallnorb and svhn) I mentioned in the previous post, I think perhaps you misunderstood my words. They could be perfectly downloaded and pre-processed for training and evaluation using our code. The problem is that, since in TFDS data feature dict, they only contain image tensor but no filenames. So we are unable to retrieve them for u either.

And sorry to hear that TFDS doesn't work properly on u end but looks like there's nothing else we can do to help with that, except for releasing the training images filenames (when it is originally available in tfds).

Feel free to let us know if there's anything else we can help with.

KMnP commented 2 years ago

@zhaoedf I also recommend reading through the tips and notes mentioned in VTAB_SETUP.md. Since it also took us whole weekend to setup the full vtab datasets. All the lessons we learned are included in that doc. Maybe it will help you in the data preparation process!

zhaoedf commented 2 years ago

@zhaoedf so what're the error message u got for the 5 datasets (sun397, patch_camelyon, clevr, diabetic_retinopathy and dtd) u cannot generate? We have already released the 1k train/val filenames for these datasets so I think one possible solution might be that, u download these data manually and split them using our info. Let us know if these could work for you.

For those 3 datasets (dsprite, smallnorb and svhn) I mentioned in the previous post, I think perhaps you misunderstood my words. They could be perfectly downloaded and pre-processed for training and evaluation using our code. The problem is that, since in TFDS data feature dict, they only contain image tensor but no filenames. So we are unable to retrieve them for u either.

And sorry to hear that TFDS doesn't work properly on u end but looks like there's nothing else we can do to help with that, except for releasing the training images filenames (when it is originally available in tfds).

Feel free to let us know if there's anything else we can help with.

for both of you,

i have read the VTAB_SETUP.md
i fully understood your meaning.
dsprite, smallnorb and svhn are not worries, since i have taken care of them.
i used the same tfds version as yours.
yes, i did use your splits info and sucessfully generate dtd

but problem remains.

for @Tsingularity and @KMnP , let's use sun397 as a example. in tfds sun397, there is a test split which has only 21750 samples, while the sun397 originnal partition files (i.e. Testing_0x.txt) have far more than 21750 samples, and the split info you provided only has train and val split and that is not enough. it would be helpful if you could provide the test split as well (i only need cleve, sun397 and patch_camelyon).

zhaoedf commented 2 years ago

cleve

update, only sun397 test splits is needed.

guobabaya commented 1 year ago

hi, for the vtab-1k benchmark, we need to use the tensorflow api to get the exact dataset splits, which is quite hard for people from mainland, China.

i was wondering if you could upload the splits txt files to this repo? thx

你好，我也使用不了tfds的api接口，我不知道怎么处理离线下载的数据才可以使程序运行，请问你可以给我一些指导吗，万分感谢！

KantLru commented 11 months ago

hi, for the vtab-1k benchmark, we need to use the tensorflow api to get the exact dataset splits, which is quite hard for people from mainland, China. i was wondering if you could upload the splits txt files to this repo? thx

你好，我也使用不了tfds的api接口，我不知道怎么处理离线下载的数据才可以使程序运行，请问你可以给我一些指导吗，万分感谢！

你好，我也遇见了数据下载、读取的问题，请问你解决了吗

KMnP / vpt

vtab1k dataset splits #1