datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

datalad crawl: Changing behaviour between HCP900/1200 #48

Open TobiasKadelka opened 5 years ago

TobiasKadelka commented 5 years ago

At the moment, I am trying the datalad-crawler for 1 subject. At first, I tried it with "HCP" as a prefix (for HCP_500), then ran "datalad crawl" and saved. After that I changed the prefix-value in crawl.cfg to HCP_900, ran datalad crawl again and it worked. But when I change the prefix now to HCP_1200 I get an error message for "datalad crawl". (Also, when I change it between 900 and 1200 and run "datalad crawl" again, the error message changes with it.)

crawl.cfg (datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master ❱ cat .datalad/crawl/crawl.cfg 1 ! [crawl:pipeline] template = simple_s3 _prefix = HCP_1200/123420/ _bucket = hcp-openaccess _to_http = False _skip_problematic = False
datalad --dbg crawl for HCP_900 ''' (datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master ❱ datalad --dbg crawl [INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg [INFO ] Creating a pipeline for the hcp-openaccess bucket [INFO ] Running pipeline [, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)] [INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication [INFO ] Finished running pipeline: skipped: 16446 [INFO ] Total stats: skipped: 16446, Datasets crawled: 1 Exception ignored in: '''
datalad --dbg crawl for HCP_1200 ''' (datalad) tkadelka@brainb02 in ~/hcp_test/123420 on git:master ❱ datalad --dbg crawl [INFO ] Loading pipeline specification from ./.datalad/crawl/crawl.cfg [INFO ] Creating a pipeline for the hcp-openaccess bucket [INFO ] Running pipeline [, switch(default=None, key='datalad_action', mapping=<<{'commit': >, re=False)] [INFO ] S3 session: Connecting to the bucket hcp-openaccess with authentication Traceback (most recent call last): File "/home/tkadelka/env/datalad/bin/datalad", line 8, in main() File "/home/tkadelka/env/datalad/datalad/datalad/cmdline/main.py", line 500, in main ret = cmdlineargs.func(cmdlineargs) File "/home/tkadelka/env/datalad/datalad/datalad/interface/base.py", line 643, in call_from_parser ret = cls.__call__(**kwargs) File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/crawl.py", line 130, in __call__ output = run_pipeline(pipeline, stats=stats) File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline output = list(xrun_pipeline(*args, **kwargs)) File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)): File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps for data_ in data_in_to_loop: File "/home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py", line 187, in __call__ versions_sorted = versions_sorted[start:] UnboundLocalError: local variable 'start' referenced before assignment > /home/tkadelka/env/datalad/datalad-crawler/datalad_crawler/nodes/s3.py(187)__call__() -> versions_sorted = versions_sorted[start:] (Pdb) '''
yarikoptic commented 5 years ago

script it and try again while also git rm -rf .datalad/crawl/versions && git commit -m "killing the version history" between switches, which would be the right thing to do, but probably might lead to some other issues. Otherwise you might miss some files, e.g. if there are changes to HCP/ AFTER initial change to HCP_900 for that subject -- then your crawl of HCP_900 will pick up only the date when changes to HCP/ happened, and thus might miss completely files added/changed to HCP_900 before that date (that is why I was thinking about doing it all via branches)