Closed yarikoptic closed 5 years ago
archives_re
E.g. here is an example use:
$> datalad create eurostat-data [INFO ] Creating a new annex repo at /mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data create(ok): /mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data (dataset) (dev) 2 15983.....................................:Mon 03 Dec 2018 02:04:14 PM EST:. smaug:/mnt/btrfs/datasets/datalad/crawl-misc $> cd eurostat-data (dev) 2 15984.....................................:Mon 03 Dec 2018 02:04:16 PM EST:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master] $> datalad crawl-init --save --template=simple_with_archives 'url=https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data' a_href_match_=.*\.Downl.*acf_d_eq1.* 'archives_re=\.gz$' [INFO ] Creating a pipeline to crawl data files from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data [INFO ] Initiating special remote datalad-archives (dev) 2 15985.....................................:Mon 03 Dec 2018 02:04:57 PM EST:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master] $> datalad crawl [INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg [INFO ] Creating a pipeline to crawl data files from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data [INFO ] Running pipeline [<function switch_branch at 0x7fd284682d70>, [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7fd2846ae910>, a_href_match(query='.*.Downl.*acf_d_eq1.*'), <function fix_url at 0x7fd293d57b18>, <datalad_crawler.nodes.annex.Annexificator object at 0x7fd2953c9550>]], <function switch_branch at 0x7fd28fa7caa0>, [<function merge_branch at 0x7fd284687f50>, [find_files(dirs=False, fail_if_none=True, regex='\\.gz$', topdir='.'), <function _add_archive_content at 0x7fd284687c08>]], <function switch_branch at 0x7fd284687e60>, <function merge_branch at 0x7fd284687ed8>, <function _finalize at 0x7fd284639050>] [INFO ] Found branch non-dirty -- nothing was committed [INFO ] Checking out master into a new branch incoming [INFO ] Fetching 'https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&dir=data' [INFO ] Need to download 11.7 kB from https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Facf_d_eq1.tsv.gz. No progress indication will be reported [INFO ] Repository found dirty -- adding and committing [INFO ] Checking out master into a new branch incoming-processed [INFO ] Initiating 1 merge of incoming using strategy theirs [INFO ] Adding content of the archive ./acf_d_eq1.tsv.gz into annex <AnnexRepo path=/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data (<class 'datalad.support.annexrepo.AnnexRepo'>)> [INFO ] Finished adding ./acf_d_eq1.tsv.gz: Files processed: 1, +annex: 1 [INFO ] Repository found dirty -- adding and committing [INFO ] Checking out an existing branch master [INFO ] Initiating 1 merge of incoming-processed using strategy None [INFO ] Found branch non-dirty -- nothing was committed [INFO ] House keeping: gc, repack and clean [INFO ] Finished running pipeline: URLs processed: 2, downloaded: 1, size: 11.7 kB, Files processed: 4, skipped: 1, +annex: 2, Branches merged: incoming->incoming-processed [INFO ] Total stats: URLs processed: 2, downloaded: 1, size: 11.7 kB, Files processed: 4, skipped: 1, +annex: 2, Branches merged: incoming->incoming-processed, Datasets crawled: 1 datalad crawl 3.20s user 7.39s system 59% cpu 17.898 total (dev) 2 15986.....................................:Mon 03 Dec 2018 02:05:18 PM EST:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master] $> ls acf_d_eq1.tsv@ (dev) 2 15987.....................................:Mon 03 Dec 2018 02:05:19 PM EST:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data[master] $> git annex whereis acf_d_eq1.tsv whereis acf_d_eq1.tsv (2 copies) 01cd81d9-15b7-4777-8812-6e006f93ae21 -- yoh@smaug:/mnt/btrfs/datasets/datalad/crawl-misc/eurostat-data [here] c04eb54b-4b4e-5755-8436-866b043170fa -- [datalad-archives] datalad-archives: dl+archive:MD5E-s11707--58c6c0def96cdffac49d88e92fb5c4dd.tsv.gz#path=acf_d_eq1.tsv&size=83818 ok
Closes #24
Crappy scrapy fails Travis for is again. The rest passes so I will merge
archives_re
should be provided to the pipelineE.g. here is an example use:
Closes #24