datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

Support pure .gz (not .tar.gz) files by exposing a new template argument archives_re #27

Closed yarikoptic closed 5 years ago

yarikoptic commented 5 years ago

E.g. here is an example use:

$> datalad create eurostat-data     
[INFO   ] Creating a new annex repo at /mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data 
create(ok): /mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data (dataset)
(dev) 2 15983.....................................:Mon 03 Dec 2018 02:04:14 PM EST:.
smaug:/mnt/btrfs/datasets/datalad/crawl-misc
$> cd eurostat-data
(dev) 2 15984.....................................:Mon 03 Dec 2018 02:04:16 PM EST:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master]
$> datalad crawl-init --save --template=simple_with_archives 'url=https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data' a_href_match_=.*\.Downl.*acf_d_eq1.* 'archives_re=\.gz$'
[INFO   ] Creating a pipeline to crawl data files from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data                                                   
[INFO   ] Initiating special remote datalad-archives 
(dev) 2 15985.....................................:Mon 03 Dec 2018 02:04:57 PM EST:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master]
$> datalad crawl
[INFO   ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg 
[INFO   ] Creating a pipeline to crawl data files from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data 
[INFO   ] Running pipeline [<function switch_branch at 0x7fd284682d70>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7fd2846ae910>, a_href_match(query='.*.Downl.*acf_d_eq1.*'), <function fix_url at 0x7fd293d57b18>, <datalad_crawler.nodes.annex.Annexificator object at 0x7fd2953c9550>]], <function switch_branch at 0x7fd28fa7caa0>, [<function merge_branch at 0x7fd284687f50>, [find_files(dirs=False, fail_if_none=True, regex='\\.gz$', topdir='.'), <function _add_archive_content at 0x7fd284687c08>]], <function switch_branch at 0x7fd284687e60>, <function merge_branch at 0x7fd284687ed8>, <function _finalize at 0x7fd284639050>] 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] Checking out master into a new branch incoming 
[INFO   ] Fetching 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data' 
[INFO   ] Need to download 11.7 kB from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Facf_d_eq1.tsv.gz. No progress indication will be reported 
[INFO   ] Repository found dirty -- adding and committing 
[INFO   ] Checking out master into a new branch incoming-processed 
[INFO   ] Initiating 1 merge of incoming using strategy theirs 
[INFO   ] Adding content of the archive ./acf_d_eq1.tsv.gz into annex <AnnexRepo path=/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data (<class 'datalad.support.annexrepo.AnnexRepo'>)> 
[INFO   ] Finished adding ./acf_d_eq1.tsv.gz: Files processed: 1, +annex: 1 
[INFO   ] Repository found dirty -- adding and committing 
[INFO   ] Checking out an existing branch master 
[INFO   ] Initiating 1 merge of incoming-processed using strategy None 
[INFO   ] Found branch non-dirty -- nothing was committed 
[INFO   ] House keeping: gc, repack and clean 
[INFO   ] Finished running pipeline: URLs processed: 2, downloaded: 1, size: 11.7 kB,  Files processed: 4, skipped: 1, +annex: 2,  Branches merged: incoming->incoming-processed 
[INFO   ] Total stats: URLs processed: 2, downloaded: 1, size: 11.7 kB,  Files processed: 4, skipped: 1, +annex: 2,  Branches merged: incoming->incoming-processed,  Datasets crawled: 1 
datalad crawl  3.20s user 7.39s system 59% cpu 17.898 total
(dev) 2 15986.....................................:Mon 03 Dec 2018 02:05:18 PM EST:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master]
$> ls
acf_d_eq1.tsv@
(dev) 2 15987.....................................:Mon 03 Dec 2018 02:05:19 PM EST:.
(git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master]
$> git annex whereis acf_d_eq1.tsv 
whereis acf_d_eq1.tsv (2 copies) 
        01cd81d9-15b7-4777-8812-6e006f93ae21 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data [here]
        c04eb54b-4b4e-5755-8436-866b043170fa -- [datalad-archives]

  datalad-archives: dl+archive:MD5E-s11707--58c6c0def96cdffac49d88e92fb5c4dd.tsv.gz#path=acf_d_eq1.tsv&size=83818
ok

Closes #24

yarikoptic commented 5 years ago

Crappy scrapy fails Travis for is again. The rest passes so I will merge