datalad / datasets.datalad.org

Registry of public datasets provided by the DataLad project
http://datasets.datalad.org
8 stars 5 forks source link

Any chance of hosting the VoxCeleb datasets? #34

Open NeuroForLunch opened 3 years ago

NeuroForLunch commented 3 years ago

The downloads are very slow from their site, the mirrors do not always work, and their google drive link is dead.

https://www.robots.ox.ac.uk/~vgg/data/voxceleb/

yarikoptic commented 3 years ago

wow -- https://www.robots.ox.ac.uk/~vgg/data is an awesome collection of datasets. I think collecting them under http://datasets.datalad.org/?dir=/labs/vgg (or may be even just straight on the top level?) . Some are an easy job for the crawler. Running now

datalad crawl-init --save --template=simple_with_archives 'a_href_match_=.*/data/.*\.zip' url=https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ leading_dirs_depth=0
datalad crawl
datalad install -d . -s https://github.com/joonson/voxconverse labels

to see what happens for voxconverse one... Result you can see at https://github.com/yarikoptic/demo-vgg-voxconverse (I am not redistributing any data file there, so to datalad get will fetch entire original archive from its original location for this one), which I got there via

datalad create-sibling-github --github-login yarikoptic -s gh-yarikoptic demo-vgg-voxconverse
datalad push --to gh-yarikoptic  # after tuning url to be ssh since github no longer allows user/pw..

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

(attn @joonson) and we would be glad to help to ensure dissemination and easier access. But not sure if we could host and re-distribute all of it from datasets.datalad.org where we generally prefer to not mirror the data. May be we could/should provide re-distribution through datalad-osf special remote, i.e. depositing to OSF...

overall -- the chance exists, but needs thinking/time investment to make it happen. Interested to join the effort? ;-)

NeuroForLunch commented 3 years ago

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

yarikoptic commented 3 years ago

The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together

It would be awesome to be able to download a certain number of files instead of the giant archives.

I do get the incentive and it should be possible