datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

crawler pipeline for 'indexes' (ftp/http) with specs for where to break into submodules #78

Open yarikoptic opened 8 years ago

yarikoptic commented 8 years ago

Looking at http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/, we could easily specify a list of regexps to specify at which level to break into submodules (e.g. http://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/phase1/data/)

yarikoptic commented 7 years ago

Here is a good one for you @glalteva . Create a new template/pipeline (e.g. call it "index_fetcher") which would allow to define topurl, and then how to split into subdatasets, and fetch all those in. Also all the versioning support (optional) as e.g. on this website if file names carry versioned suffixes

glalteva commented 7 years ago

other (smaller) datasets to work with: http://index.okfn.org/dataset/ http://data.caida.org/datasets/as-relationships/ https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/