Open NeuroForLunch opened 3 years ago
wow -- https://www.robots.ox.ac.uk/~vgg/data is an awesome collection of datasets. I think collecting them under http://datasets.datalad.org/?dir=/labs/vgg (or may be even just straight on the top level?) . Some are an easy job for the crawler. Running now
datalad crawl-init --save --template=simple_with_archives 'a_href_match_=.*/data/.*\.zip' url=https://www.robots.ox.ac.uk/~vgg/data/voxconverse/ leading_dirs_depth=0
datalad crawl
datalad install -d . -s https://github.com/joonson/voxconverse labels
to see what happens for voxconverse one... Result you can see at https://github.com/yarikoptic/demo-vgg-voxconverse (I am not redistributing any data file there, so to datalad get
will fetch entire original archive from its original location for this one), which I got there via
datalad create-sibling-github --github-login yarikoptic -s gh-yarikoptic demo-vgg-voxconverse
datalad push --to gh-yarikoptic # after tuning url to be ssh since github no longer allows user/pw..
The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together
(attn @joonson) and we would be glad to help to ensure dissemination and easier access. But not sure if we could host and re-distribute all of it from datasets.datalad.org where we generally prefer to not mirror the data. May be we could/should provide re-distribution through datalad-osf special remote, i.e. depositing to OSF...
overall -- the chance exists, but needs thinking/time investment to make it happen. Interested to join the effort? ;-)
The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together
It would be awesome to be able to download a certain number of files instead of the giant archives.
The voxceleb is trickier due to all the split archives, and our crawler can fetch them but then we really would need to re-distributed extracted files after manual "cat"ing them all together
It would be awesome to be able to download a certain number of files instead of the giant archives.
I do get the incentive and it should be possible
The downloads are very slow from their site, the mirrors do not always work, and their google drive link is dead.
https://www.robots.ox.ac.uk/~vgg/data/voxceleb/