Permission Error on HumanOrWorm Dataset

ML-Bioinfo-CEITEC / genomic_benchmarks

Benchmarks for classification of genomic sequences

Apache License 2.0

107 stars 14 forks source link

Permission Error on HumanOrWorm Dataset #29

Open tomginsberg opened 2 years ago

tomginsberg commented 2 years ago

This should be a simple fix of updating human or worm to Anyone with the link

from genomic_benchmarks.dataset_getters.pytorch_datasets import DemoHumanOrWorm
dset = DemoHumanOrWorm(split='train')

Cannot retrieve the public link of the file. You may need to change
    the permission to 'Anyone with the link', or have had many accesses. 

You may still be able to access the file from the browser:

     https://drive.google.com/uc?id=1JW0-eTB-rJXvFcglqBo3pFZi1kyIWC3X

katarinagresova commented 2 years ago

Dear @tomginsberg, thanks for the issue. We are currently working on it but unfortunately, it is not as simple as updating the permissions. As a temporary workaround, you can use the following code to obtain the data:

from genomic_benchmarks.loc2seq import download_dataset
download_dataset('demo_human_or_worm', version=0,  use_cloud_cache=False)

This will download human and worm genomes and it will create the dataset on your disk. Afterward, you can use your original code to load the dataset from the disk:

from genomic_benchmarks.dataset_getters.pytorch_datasets import DemoHumanOrWorm
dset = DemoHumanOrWorm(split='train')

katarinagresova commented 2 years ago

It seems to be a known issue of gdown: https://github.com/wkentaro/gdown/issues/43

katarinagresova commented 2 years ago

Google cache was set to False by default in 5b02bb9745efb6f9328da98de19c7729ecdefa9e. You can use your original code to download the dataset and it will create it for you from the reference genome. However, if you want to try to download it from the google cache, you can do it by manually setting the use_cloud_cache=True:

from genomic_benchmarks.dataset_getters.pytorch_datasets import DemoHumanOrWorm
dset = DemoHumanOrWorm(split='train', use_cloud_cache=True`)

simecek commented 2 years ago

I have returned use_cloud_cache=True as the default (it is a desirable behavior in 99.9% cases), so I am reopening the issue. We need to examined it better. Unfortunately, it is hard to reproduce the error.

The possible solutions (if gdown does not figure it out soon) might be

Reverting back to googleDriveFileDownloader if possible (the issue appeared after we switched from googleDriveFileDownloader to gdown. I have not found documented what exactly was the issue with googleDriveFileDownloader.
Move the cached files to a similar service provided by the Masaryk university, such as OneDrive or ownCloud.
Move the files to AWS S3 or GC bucket and pay for the storage.