datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

XNAT and NITRC support #9

Closed chaselgrove closed 4 years ago

chaselgrove commented 5 years ago

Tests require DATALAD_TEST_XNAT to be set (even with a VCR cassette in place, the datalad_crawler.pipelines.tests.test_xnat tests take 4-5 hours).

I still can't get test_nitrc_pipeline to run, even after a datalad download-url to NITRC. I avoid the test error by using get_test_providers('https://www.nitrc.org/ir/').

codecov-io commented 5 years ago

Codecov Report

Merging #9 into master will decrease coverage by 2.95%. The diff coverage is 36.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   82.55%   79.60%   -2.96%     
==========================================
  Files          53       55       +2     
  Lines        4214     4500     +286     
==========================================
+ Hits         3479     3582     +103     
- Misses        735      918     +183     
Impacted Files Coverage Δ
datalad_crawler/pipelines/xnat.py 25.41% <25.41%> (ø)
datalad_crawler/pipelines/tests/test_xnat.py 54.28% <54.28%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b1210ee...6adab7b. Read the comment docs.

yarikoptic commented 4 years ago

@mih if interested to try -- I pushed up some tune ups so there is no (as) heavy cost for (re)crawling a single dataset. I am recrawling a few sample datasets ATM and then will try to run it afresh for NITRC-ir and/or xnat central to see what we would get. For the final one I ideally should extend the pipeline with

  1. specifying metadata extractors (I guess nifti1, dicom, and xmp)
  2. calling out to metadata extraction before dropping all the load upon finishing the crawl

and then we could populate some datasets if there is an interest to get access to xnat central or nitrc-ir. FWIW -- I have not tried with non-public setups yet and some logic for dropping non-empty ones would restrict only to "public" - not sure if I should just kick that away. may be @chaselgrove remembers why there was that limit

mih commented 4 years ago

Ping @bpoldrack

bpoldrack commented 4 years ago

Looking into it.

Tried to run the tests first and failed. Probably a minor thing (assuming that for some reason test_nitrc_superpipeline doesn't raise SKIP: This test requires known credentials for nitrc while test_nitrc_pipeline does. @yarikoptic you might want to have a look:

Tests output ``` $> DATALAD_TEST_XNAT=1 python -m nose -s -v datalad_crawler/pipelines/tests/test_xnat.py datalad_crawler.pipelines.tests.test_xnat.test_nitrc_pipeline ... SKIP: This test requires known credentials for nitrc datalad_crawler.pipelines.tests.test_xnat.test_nitrc_superpipeline ... FAIL datalad_crawler.pipelines.tests.test_xnat.test_smoke_pipelines ... [.get_projects at 0x7fe2bdeacc80>, assign(assignments=<<{'dataset': '%(id)s', ...>>, interpolate=True), initiate_dataset(add_fields={}, add_to_super='auto', backend=None, branch=None, data_fields=<<['dataset', 'url', 'pr...>>, dataset_name=None, existing='skip', path=None, template='xnat', template_func=None, template_kwargs=None)] ok datalad_crawler.pipelines.tests.test_xnat.test_basic_xnat_interface ... ok Versions: appdirs=1.4.3 boto=2.49.0 cmd:annex=7.20190129 cmd:git=2.20.1 cmd:system-git=2.20.1 cmd:system-ssh=7.9p1 git=2.1.11 gitdb=2.0.5 humanize=0.5.1 iso8601=0.1.12 keyring=19.0.2 keyrings.alt=3.1.1 msgpack=0.6.1 requests=2.22.0 scrapy=1.6.0 six=1.12.0 tqdm=4.32.2 wrapt=1.11.2 Obscure filename: str=b' "\';a&b&c\xce\x94\xd0\x99\xd7\xa7\xd9\x85\xe0\xb9\x97\xe3\x81\x82 `| ' repr=' "\';a&b&cΔЙקم๗あ `| ' Encodings: default='utf-8' filesystem='utf-8' locale.prefered='UTF-8' Environment: LANGUAGE='en_US:en' LANG='en_US.UTF-8' PATH='/tmp/test-xnat/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games' GIT_PYTHON_GIT_EXECUTABLE='/usr/bin/git' ====================================================================== FAIL: datalad_crawler.pipelines.tests.test_xnat.test_nitrc_superpipeline ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/test-xnat/lib/python3.7/site-packages/nose/case.py", line 198, in runTest self.test(*self.arg) File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipelines/tests/test_xnat.py", line 72, in newfunc return func(*args, **kwargs) File "/tmp/test-xnat/lib/python3.7/site-packages/datalad/tests/utils.py", line 615, in newfunc return t(*(arg + (filename,)), **kw) File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipelines/tests/test_xnat.py", line 144, in test_nitrc_superpipeline out = run_pipeline(pipeline) File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipeline.py", line 114, in run_pipeline output = list(xrun_pipeline(*args, **kwargs)) File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipeline.py", line 194, in xrun_pipeline for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)): File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipeline.py", line 270, in xrun_pipeline_steps for data_ in data_in_to_loop: File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipelines/xnat.py", line 317, in get_projects drop_empty=drop_empty File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipelines/xnat.py", line 184, in get_projects all_projects = self('data/projects', **kw) File "/home/ben/work/hacking/datalad-crawler/datalad_crawler/pipelines/xnat.py", line 137, in __call__ assert j.keys() == ['resultset'] AssertionError: -------------------- >> begin captured logging << -------------------- urllib3.connectionpool: DEBUG: Starting new HTTPS connection (1): www.nitrc.org:443 urllib3.connectionpool: DEBUG: https://www.nitrc.org:443 "GET /ir/data/projects?accessible=true&format=json HTTP/1.1" 200 None --------------------- >> end captured logging << --------------------- ---------------------------------------------------------------------- Ran 4 tests in 3.742s FAILED (SKIP=1, failures=1) ```

Edit: Whether or not that is indeed the issue with that particular test, we should have a proper error message instead of just crashing in such cases.

yarikoptic commented 4 years ago

Thanks! Would you mind get into debugger at that point and print j ?

bpoldrack commented 4 years ago

Here you go, @yarikoptic

(Pdb) print(j) ``` {'resultset': {'Result': [{'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2011-06-06 04:43:51.0', 'description': 'The 1000 Functional Connectomes Project.', 'last_accessed_3': '2019-07-17 05:18:51.813922', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': '1000 FC', 'name': '1000 Functional Connectomes', 'pi': '', 'id': 'fcon_1000'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2012-09-06 07:20:15.0', 'description': 'Autism Brain Imaging Data Exchange', 'last_accessed_3': '', 'insert_user': '', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'ABIDE', 'name': 'ABIDE', 'pi': '', 'id': 'ABIDE'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2018-10-22 09:30:13.782', 'description': 'Autism Brain Imaging Data Exchange II', 'last_accessed_3': '', 'insert_user': 'admin', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'ABIDE_II', 'name': 'ABIDE_II', 'pi': '', 'id': 'ABIDE_II'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2012-07-24 18:50:32.0', 'description': 'The ADHD-200 sample from the 1000 Functional Connectomes project.', 'last_accessed_3': '', 'insert_user': 'acrowley', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'ADHD-200', 'name': 'ADHD-200', 'pi': '', 'id': 'adhd_200'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2013-04-03 12:43:41.0', 'description': 'INDI Beijing Enhanced.', 'last_accessed_3': '', 'insert_user': 'acrowley', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'Beijing Enhanced', 'name': 'Beijing Enhanced', 'pi': '', 'id': 'beijing_enh'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2013-04-03 12:44:30.0', 'description': 'INDI Beijing Eyes Open Eyes Closed Study', 'last_accessed_3': '', 'insert_user': 'acrowley', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'Beijing Eyes Open Eyes C', 'name': 'Beijing Eyes Open Eyes Closed', 'pi': '', 'id': 'beijing_eoec'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2013-04-03 12:45:09.0', 'description': 'INDI Beijing Short TR Study', 'last_accessed_3': '', 'insert_user': 'acrowley', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'Beijing Short TR', 'name': 'Beijing Short TR', 'pi': '', 'id': 'short_tr'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2018-10-24 15:23:35.883', 'description': 'CMI Healthy Brain Network', 'last_accessed_3': '', 'insert_user': 'admin', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'CMI Healthy Brain Networ', 'name': 'CMI Healthy Brain Network', 'pi': '', 'id': 'hbn'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2014-09-19 09:56:08.0', 'description': 'The goal of CoRR was to create an open science resource for the imaging community that facilitates the assessment of test-retest reliability and reproducibility ', 'last_accessed_3': '', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'CoRR', 'name': 'Consortium for Reliability and Reproducibility (CoRR)', 'pi': '', 'id': 'corr'}, {'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2019-06-25 16:53:40.448', 'description': 'The Dallas Lifespan Brain Study (DLBS) is a major effort designed to understand the antecedents of preservation and decline of cognitive function at different st', 'last_accessed_3': '2019-07-17 06:08:21.330924', 'insert_user': 'admin', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'DLBS', 'name': 'Dallas Lifespan Brain Study', 'pi': '', 'id': 'dlbs'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2018-02-23 15:37:55.166', 'description': 'The Brain Genomics Superstruct Project Open Access Data Release exposes a carefully vetted collection of neuroimaging, behavior, cognitive, and personality data ', 'last_accessed_3': '', 'insert_user': 'acrowley', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'GSP', 'name': 'Brain Genomics Superstruct Project', 'pi': '', 'id': 'GSP'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2013-09-23 14:03:46.0', 'description': '', 'last_accessed_3': '', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'INDI NKI/Rockland Sample', 'name': 'INDI NKI/Rockland Sample', 'pi': '', 'id': 'nki_rockland'}, {'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2015-02-23 13:21:47.0', 'description': 'IXI (Information eXtraction from Images) dataset. \n 600 MR images from normal, healthy subjects. The MR image acquisition protocol for each subject includes T1, ', 'last_accessed_3': '2019-07-17 07:11:32.687932', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'IXI dataset', 'name': 'IXI dataset', 'pi': '', 'id': 'ixi'}, {'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2014-09-23 14:11:09.0', 'description': "Data for a set of 53 subjects in a cross-sectional Parkinson's disease (PD) study. The dataset contains diffusion-weighted images (DWI) of 27 PD patients and 26 ", 'last_accessed_3': '2019-07-17 04:19:50.086868', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': "Parkinson's DTI", 'name': "High-quality diffusion-weighted imaging of Parkinson's disease", 'pi': '', 'id': 'parktdi'}, {'project_access': 'protected', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/key.gif', 'insert_date': '2013-06-26 07:45:45.0', 'description': '', 'last_accessed_3': '', 'insert_user': '', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'PING Study', 'name': 'Pediatric Imaging, Neurocognition, and Genetics (PING) Study', 'pi': '', 'id': 'PING'}, {'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2012-02-21 12:34:10.0', 'description': 'Version 1.1 of the CANDI Share Schizophrenia Bulletin 2008 data.', 'last_accessed_3': '2019-07-16 17:35:51.02245', 'insert_user': '', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'Schiz Bull 2008', 'name': 'CANDI Share: Schizophrenia Bulletin 2008', 'pi': '', 'id': 'cs_schizbull08'}, {'project_access': 'public', 'quarantine_status': 'active', 'project_access_img': '/@WEBAPPimages/globe.gif', 'insert_date': '2014-08-08 13:21:40.0', 'description': 'Study Forrest rev00\n3.\n\n\nhttp://studyforrest.org/\n\n\nSupported by BMBF 01GQ1112 and NSF 1129855.', 'last_accessed_3': '2019-07-15 19:35:14.139816', 'insert_user': 'christian', 'user_role_3': '', 'project_invs': '', 'secondary_id': 'studyforrest rev003', 'name': 'Study Forrest rev003', 'pi': '', 'id': 'studyforrest_rev003'}], 'xdat_user_id': '3', 'totalRecords': '17', 'title': 'Projects'}} ```
bpoldrack commented 4 years ago

@yarikoptic Hehe. I just blindly c&p'ed here, but didn't even look at it.

Actually, it seems to be a python version issue:

(Pdb) print(j.keys())
dict_keys(['resultset'])
(Pdb) type(j.keys())
<class 'dict_keys'>
$ python --version
Python 3.7.3
yarikoptic commented 4 years ago

hm, I am confused -- I cut/pasted that thing in ipython and for it assert j.keys() == ['resultset'] is good. What is j.keys() for you?

yarikoptic commented 4 years ago

ah, right -- python version! coolio, will push that fix

bpoldrack commented 4 years ago

Thx!

Trying to use it for real - will report.

bpoldrack commented 4 years ago

I'm somewhat confused by the parameters. If I get it right, the pipeline takes url, dataset and project_access. Didn't look into the last, but judging from the hierarchy created by the test, dataset actually is the project, right? If so, can we name it that way? Not only is it confusing since a "dataset" is what I'm putting that information into, but also since it's xnat.py - so lets use "XNAT speak", no?

yarikoptic commented 4 years ago

Yep, sounds reasonable. Try pushing your changes

bpoldrack commented 4 years ago

Ok, I'll assemble those little things and then try to push.

Testing this will take at least till tomorrow - no need to block Travis several times for those tiny changes.

bpoldrack commented 4 years ago

Another python 3 issue, I guess. I crawled a nitrc superdataset and then tried to crawl its subdataset studyforrest_rev003, resulting in [ERROR ] ...>downloaders.base:603,145,618,554 Failed to fetch https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt: cannot use a string pattern on a bytes-like object [base.py:_fetch:548,base.py:_verify_download:343,http.py:check_for_auth_failure:209,re.py:search:183]

Post mortem: ``` TypeError Traceback (most recent call last) /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in _fetch(self=HTTPDownloader(authenticator=<>, credential=<>), url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', cache=False, size=10000, allow_redirects=True, decode=False) 547 --> 548 self._verify_download(url, downloaded_size, target_size, None, content=content) self._verify_download = >, credential=<>)> url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' downloaded_size = 515 target_size = 515 content = b'Manufacturer: Philips Medical Systems\nModel name: Achieva\nRepetition time (ms): 12.0571002960205\nEcho time[0] (ms): \nEcho time[1] (ms): 5.797\nInversion time (ms): \nFlip angle: 8\nNumber of averages: 1\nSlice thickness (mm): 0.7\nSlice spacing (mm): 0.7\nImage columns: 384\nImage rows: 384\nNumber of frames: \nPhase encoding direction: ROW\nVoxel size x (mm): 0.666667\nVoxel size y (mm): 0.666667\nNumber of volumes: 1\nNumber of slices: 274\nNumber of files: 274\nNumber of frames: 0\nSlice duration (ms) : 0\nOrientation: sag\n' 549 /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in _verify_download(self=HTTPDownloader(authenticator=<>, credential=<>), url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', downloaded_size=515, target_size=515, file_=None, content=b'Manufacturer: Philips Medical Systems\nModel n...: 0\nSlice duration (ms) : 0\nOrientation: sag\n') 342 self.authenticator.check_for_auth_failure( --> 343 content, "Download of the url %s has failed: " % url) content = b'Manufacturer: Philips Medical Systems\nModel name: Achieva\nRepetition time (ms): 12.0571002960205\nEcho time[0] (ms): \nEcho time[1] (ms): 5.797\nInversion time (ms): \nFlip angle: 8\nNumber of averages: 1\nSlice thickness (mm): 0.7\nSlice spacing (mm): 0.7\nImage columns: 384\nImage rows: 384\nNumber of frames: \nPhase encoding direction: ROW\nVoxel size x (mm): 0.666667\nVoxel size y (mm): 0.666667\nNumber of volumes: 1\nNumber of slices: 274\nNumber of files: 274\nNumber of frames: 0\nSlice duration (ms) : 0\nOrientation: sag\n' url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' 344 /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/http.py in check_for_auth_failure(self=HTTPAuthAuthenticator(failure_re=<<['(Member log...success_re=None, url='https://www.nitrc.org/ir/'), content=b'Manufacturer: Philips Medical Systems\nModel n...: 0\nSlice duration (ms) : 0\nOrientation: sag\n', err_prefix='Download of the url https://www.nitrc.org/ir/dat...54906/files/highres001_dicominfo.txt has failed: ') 208 for failure_re in self.failure_re: --> 209 if re.search(failure_re, content): global re.search = failure_re = '(Member login|Please use the Login Name and Password)' content = b'Manufacturer: Philips Medical Systems\nModel name: Achieva\nRepetition time (ms): 12.0571002960205\nEcho time[0] (ms): \nEcho time[1] (ms): 5.797\nInversion time (ms): \nFlip angle: 8\nNumber of averages: 1\nSlice thickness (mm): 0.7\nSlice spacing (mm): 0.7\nImage columns: 384\nImage rows: 384\nNumber of frames: \nPhase encoding direction: ROW\nVoxel size x (mm): 0.666667\nVoxel size y (mm): 0.666667\nNumber of volumes: 1\nNumber of slices: 274\nNumber of files: 274\nNumber of frames: 0\nSlice duration (ms) : 0\nOrientation: sag\n' 210 raise AccessDeniedError( /tmp/test-xnat/lib/python3.7/re.py in search(pattern='(Member login|Please use the Login Name and Password)', string=b'Manufacturer: Philips Medical Systems\nModel n...: 0\nSlice duration (ms) : 0\nOrientation: sag\n', flags=0) 182 a Match object, or None if no match was found.""" --> 183 return _compile(pattern, flags).search(string) global _compile = pattern = '(Member login|Please use the Login Name and Password)' flags.search = undefined string = b'Manufacturer: Philips Medical Systems\nModel name: Achieva\nRepetition time (ms): 12.0571002960205\nEcho time[0] (ms): \nEcho time[1] (ms): 5.797\nInversion time (ms): \nFlip angle: 8\nNumber of averages: 1\nSlice thickness (mm): 0.7\nSlice spacing (mm): 0.7\nImage columns: 384\nImage rows: 384\nNumber of frames: \nPhase encoding direction: ROW\nVoxel size x (mm): 0.666667\nVoxel size y (mm): 0.666667\nNumber of volumes: 1\nNumber of slices: 274\nNumber of files: 274\nNumber of frames: 0\nSlice duration (ms) : 0\nOrientation: sag\n' 184 TypeError: cannot use a string pattern on a bytes-like object During handling of the above exception, another exception occurred: DownloadError Traceback (most recent call last) /tmp/test-xnat/bin/datalad in 6 # 7 from datalad.cmdline.main import main ----> 8 main() global main = /tmp/test-xnat/lib/python3.7/site-packages/datalad/cmdline/main.py in main(args=['/tmp/test-xnat/bin/datalad', '--idbg', 'crawl']) 492 from datalad.interface.base import Interface 493 Interface._interrupted_exit_code = None --> 494 ret = cmdlineargs.func(cmdlineargs) ret = None cmdlineargs.func = > cmdlineargs = Namespace(_=False, cfg_overrides=None, change_path=None, chdir=None, common_debug=False, common_idebug=True, common_on_failure=None, common_output_format='default', common_proc_post=None, common_proc_pre=None, common_report_status=None, common_report_type=None, func=>, help=None, is_pipeline=False, is_template=False, log_level='warning', logger=, path=None, pbs_runner=None, recursive=False, subparser=ArgumentParserDisableAbbrev(prog='datalad crawl', usage=None, description='Crawl online resource to create or update a dataset.\n\nExamples:\n\n $ datalad crawl # within a dataset having .datalad/crawl/crawl.cfg', formatter_class=, conflict_handler='error', add_help=False)) 495 else: 496 # otherwise - guard and only log the summary. Postmortem is not /tmp/test-xnat/lib/python3.7/site-packages/datalad/interface/base.py in call_from_parser(cls=, args=Namespace(_=False, cfg_overrides=None, change_pa...ter'>, conflict_handler='error', add_help=False))) 624 kwargs['proc_post'] = args.common_proc_post 625 try: --> 626 ret = cls.__call__(**kwargs) ret = undefined cls.__call__ = kwargs = {'path': None, 'is_pipeline': False, 'is_template': False, 'recursive': False, 'chdir': None} 627 if inspect.isgenerator(ret): 628 ret = list(ret) ~/work/hacking/datalad-crawler/datalad_crawler/crawl.py in __call__(path='./.datalad/crawl/crawl.cfg', is_pipeline=False, is_template=False, recursive=False, chdir=None) 128 # we could gracefully reset back 129 try: --> 130 output = run_pipeline(pipeline, stats=stats) output = undefined run_pipeline = pipeline = [.get_project_info at 0x7f37d0a52840>, [.get_files at 0x7f37d0a52488>, ], ._finalize at 0x7f37d0a4a730>] stats = ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911) 131 except Exception as exc: 132 # TODO: config.crawl.failure = full-reset | last-good-master ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in run_pipeline(*args=([.get_project_info>, [.get_files>, ], ._finalize>],), **kwargs={'stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}) 112 items, a `[{}]` will be provided as output 113 """ --> 114 output = list(xrun_pipeline(*args, **kwargs)) output = undefined global list = undefined global xrun_pipeline = args = ([.get_project_info at 0x7f37d0a52840>, [.get_files at 0x7f37d0a52488>, ], ._finalize at 0x7f37d0a4a730>],) kwargs = {'stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)} 115 if output: 116 if 'datalad_stats' in output[-1]: ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline(pipeline=[.get_project_info>, [.get_files>, ], ._finalize>], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}, stats=ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), reset=True) 192 data_in = data_to_process.pop(0) 193 try: --> 194 for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)): idata_out = undefined data_out = None global enumerate = undefined global xrun_pipeline_steps = pipeline = [.get_project_info at 0x7f37d0a52840>, [.get_files at 0x7f37d0a52488>, ], ._finalize at 0x7f37d0a4a730>] data_in = {'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)} output = 'input' output_sub = 'input' 195 if log_level <= 3: 196 # provide details of what keys got changed ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline_steps(pipeline=[.get_project_info>, [.get_files>, ], ._finalize>], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}, output='input') 284 lgr.log(7, " pass %d keys into tail with %d elements", len(data_), len(pipeline_tail)) 285 lgr.log(5, " passed keys: %s", data_.keys()) --> 286 for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output): data_out = None global xrun_pipeline_steps = pipeline_tail = [[.get_files at 0x7f37d0a52488>, ], ._finalize at 0x7f37d0a4a730>] data_ = {'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)} output = 'input' 287 if log_level <= 3: 288 # provide details of what keys got changed ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline_steps(pipeline=[[.get_files>, ], ._finalize>], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}, output='input') 268 data_out = None 269 if data_in_to_loop: --> 270 for data_ in data_in_to_loop: data_ = undefined data_in_to_loop = 271 if prev_stats is not None: 272 new_stats = data_.get('datalad_stats', None) ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline(pipeline=[.get_files>, ], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}, stats=None, reset=False) 192 data_in = data_to_process.pop(0) 193 try: --> 194 for idata_out, data_out in enumerate(xrun_pipeline_steps(pipeline, data_in, output=output_sub)): idata_out = undefined data_out = None global enumerate = undefined global xrun_pipeline_steps = pipeline = [.get_files at 0x7f37d0a52488>, ] data_in = {'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)} output = 'input' output_sub = 'input' 195 if log_level <= 3: 196 # provide details of what keys got changed ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline_steps(pipeline=[.get_files>, ], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911)}, output='input') 284 lgr.log(7, " pass %d keys into tail with %d elements", len(data_), len(pipeline_tail)) 285 lgr.log(5, " passed keys: %s", data_.keys()) --> 286 for data_out in xrun_pipeline_steps(pipeline_tail, data_, output=output): data_out = None global xrun_pipeline_steps = pipeline_tail = [] data_ = {'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), 'url': 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'path': 'NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'name': 'sub001-highres001_dicominfo.txt'} output = 'input' 287 if log_level <= 3: 288 # provide details of what keys got changed ~/work/hacking/datalad-crawler/datalad_crawler/pipeline.py in xrun_pipeline_steps(pipeline=[], data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), 'name': 'sub001-highres001_dicominfo.txt', 'path': 'NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'url': 'https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt'}, output='input') 268 data_out = None 269 if data_in_to_loop: --> 270 for data_ in data_in_to_loop: data_ = undefined data_in_to_loop = 271 if prev_stats is not None: 272 new_stats = data_.get('datalad_stats', None) ~/work/hacking/datalad-crawler/datalad_crawler/nodes/annex.py in __call__(self=, data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), 'name': 'sub001-highres001_dicominfo.txt', 'path': 'NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'url': 'https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt'}) 414 if url: 415 try: --> 416 url_status = self._get_url_status(data, url) url_status = None self._get_url_status = > data = {'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), 'url': 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'path': 'NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'name': 'sub001-highres001_dicominfo.txt'} url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' 417 except Exception: 418 if self.skip_problematic: ~/work/hacking/datalad-crawler/datalad_crawler/nodes/annex.py in _get_url_status(self=, data={'datalad_stats': ActivityStats(files=4, urls=4, add_annex=3, downloaded=3, downloaded_size=14205911), 'name': 'sub001-highres001_dicominfo.txt', 'path': 'NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt', 'url': 'https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt'}, url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt') 399 else: 400 downloader = self._providers.get_provider(url).get_downloader(url) --> 401 return downloader.get_status(url) downloader.get_status = >, credential=<>)> url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' 402 403 def __call__(self, data): # filename=None, get_disposition_filename=False): /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in get_status(self=HTTPDownloader(authenticator=<>, credential=<>), url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', old_status=None, **kwargs={}) 601 If URL is not reachable, None would be returned 602 """ --> 603 return self.access(self._get_status, url, old_status=old_status, **kwargs) self.access = >, credential=<>)> self._get_status = >, credential=<>)> url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' old_status = None kwargs = {} 604 605 # TODO: borrow from itself... ? /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in access(self=HTTPDownloader(authenticator=<>, credential=<>), method=>, credential=<>)>, url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', allow_old_session=True, **kwargs={'old_status': None}) 143 assert(not used_old_session) 144 lgr.log(5, "Calling out into %s for %s" % (method, url)) --> 145 result = method(url, **kwargs) result = undefined method = >, credential=<>)> url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' kwargs = {'old_status': None} 146 # assume success if no puke etc 147 break /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in _get_status(self=HTTPDownloader(authenticator=<>, credential=<>), url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', old_status=None) 616 else 0 617 --> 618 _, headers = self._fetch(url, cache=False, size=download_size, decode=False) _ = undefined headers = undefined self._fetch = >, credential=<>)> url = 'https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt' global cache = undefined global size = undefined download_size = 10000 global decode = undefined 619 620 # extract from headers information to depict the status of the url /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py in _fetch(self=HTTPDownloader(authenticator=<>, credential=<>), url='https://www.nitrc.org/ir/data/experiments/NITRC_...T1/resources/54906/files/highres001_dicominfo.txt', cache=False, size=10000, allow_redirects=True, decode=False) 553 e_str = exc_str(e, limit=5) 554 lgr.error("Failed to fetch {url}: {e_str}".format(**locals())) --> 555 raise DownloadError(exc_str(e, limit=8)) # for now global DownloadError = global exc_str = e = undefined global limit = undefined 556 557 if cache: DownloadError: cannot use a string pattern on a bytes-like object [base.py:_fetch:548,base.py:_verify_download:343,http.py:check_for_auth_failure:209,re.py:search:183] > /tmp/test-xnat/lib/python3.7/site-packages/datalad/downloaders/base.py(555)_fetch() 553 e_str = exc_str(e, limit=5) 554 lgr.error("Failed to fetch {url}: {e_str}".format(**locals())) --> 555 raise DownloadError(exc_str(e, limit=8)) # for now 556 557 if cache: ipdb> content b'Manufacturer: Philips Medical Systems\nModel name: Achieva\nRepetition time (ms): 12.0571002960205\nEcho time[0] (ms): \nEcho time[1] (ms): 5.797\nInversion time (ms): \nFlip angle: 8\nNumber of averages: 1\nSlice thickness (mm): 0.7\nSlice spacing (mm): 0.7\nImage columns: 384\nImage rows: 384\nNumber of frames: \nPhase encoding direction: ROW\nVoxel size x (mm): 0.666667\nVoxel size y (mm): 0.666667\nNumber of volumes: 1\nNumber of slices: 274\nNumber of files: 274\nNumber of frames: 0\nSlice duration (ms) : 0\nOrientation: sag\n' ```

Trying to track it through the downloaders to figure where exactly to fix it.

bpoldrack commented 4 years ago

Not yet sure, what's the ultimate issue here and whether it needs to be fixed in datalad rather than the crawler or this PR, but leaving a trace here in case it takes longer:

Apparently sometimes (didn't figure a pattern yet) BaseDownloader._fetch() is called with explicit decode=False from within _get_status(). I don't know yet how it sometimes manages to not go through _get_status() nor am I clear on why you would the decoding to be disabled, but that's where I am ATM.

bpoldrack commented 4 years ago

Okay. I think I got it. This particular expression of the decoding challenges is coming from:

Author: Dave MacFarlane <dave.macfarlane@mcin.ca>
Date:   Thu Mar 7 15:38:54 2019 -0500

    Do not decode fetched content when retrieving status

    When annexificator is used in a datalad-crawler pipeline, it attempts
    to call get_status on the URL to see if the file has changed. get_status
    attempts to use _fetch, which attempts to decode the content under Python3.

    This is invalid for content that isn't utf-8, and get_status was only
    trying to verify that it wasn't receiving a login or failure page or similar.
    _fetch, confusingly, converts the UnicodeFormatError into a generic DownloadError
    (despite the fact that the download succeeded, but conversion to a string failed.)

    This teaches _fetch a "decode" parameter that toggles whether or not it should
    even attempt to decode. In the case of get_status, the content is discarded,
    so the decode serves no purpose.

    NB. fetch (without the _)'s documentation claims that it doesn't decode, but
    it calls _fetch (which currently always decodes.) This can new parameter can
    eventually be used to fix that, but currently trying to fix "fetch" results in
    a json.loads error from a mysterious part in the code if fetch is updated to
    not decode as documented.

So, while I agree that binary_type might not necessarily mean that it's UTF-8, it still needs to be decoded since it will be passed on and then regex will fail when checking that response content for an error message with a TypeError. I might miss something, but I think we should switch back to not have that decode parameter and instead use decode(downloader_session.response.encoding) rather than a mere decode(), which defaults to UTF-8.

What do you think, @yarikoptic ? I'm not entirely sure what issue @driusan was attacking with that commit. So, would like to see whether the two of you see any issue with my approach.

kyleam commented 4 years ago

@bpoldrack:

I'm not entirely sure what issue @driusan was attacking with that commit. So, would like to see whether the two of you see any issue with my approach.

A specific example of the failure is shown here: https://github.com/datalad/datalad/pull/3210#pullrequestreview-212374896

bpoldrack commented 4 years ago

@kyleam : Looking through it. FWIW: My suggested fix still isn't entirely correct. While this allows me to download quite some files it still crashes later on when running into content that doesn't come with any encoding information in the response and doesn't look like it should be decoded (binary data?). This is suggesting to me, that the check for an error message simply doesn't work that way (since a regex on binary content is indeed pointless). I'm currently thinking we need both: A better condition on when to decode wrt what encoding and in addition try to re.search for an authentication error only, if we actually have a string at that point.

kyleam commented 4 years ago

I'm currently thinking we need both: A better condition on when to decode wrt what encoding and in addition try to re.search for an authentication error only, if we actually have a string at that point.

I haven't quite (re-)digested the situation, but, yeah, this is likely to need a substantial rework. As touched on in datalad/datalad#3210, fetch() and fetch_() claim to be working with bytes, but some spots like _verify_download() treat the content a decoded string, so it smells like things need a clearer separation around pre and post decoding.

driusan commented 4 years ago

If memory serves (it's been a while) the root of the problem seemed to be that it was decoding the response in order to run a failure regex on it to see if it matches the failure regex defined in the config file. Decoding the error page makes sense, because it's usually HTML, but if there is no error and the response is binary data, the decoding doesn't make sense and can throw an exception. The code path that I added the parameter for was only incidentally triggering that code when trying to check the status of if the file changed.

bpoldrack commented 4 years ago

@driusan : Thanks! Yes, that's pretty much what I ran into as well (despite your patch). So the take away is: The patch was about the same thing and while it fixed it for some cases it doesn't in others (as does the one initially suggested by me).

Ok. So, several things:

  1. The decoding issue seems not related to the PR itself. Will move the discussion to datalad core.
  2. NITRC superdataset builds fine for me with that pipeline.
  3. Crawling studyforrest subdataset then fails in two ways: With the status quo it fails to decode a response that could indeed be decoded and that even says it's encoded in UTF-8. With my "fix" (replacing the decode parameter with a check whether downloader_session.response comes with a filled encoding field and if so use that to decode) it doesn't fail at this point, but instead later on it tries to shoot that verification regex onto something that appears to be binary data (so - correctly not decoded)
  4. However, "cs_schizbull08" subdataset succeeds with my fix and under python3.
  5. Under python 2 I didn't run into an issue so far

Conclusion: The pipeline itself seems fine. The decoding issue is ... well, not one issue. And it happens on particular responses only. It's neither the kind of file that is downloaded nor the server per se, that would fail to properly announce an encoding.

driusan commented 4 years ago

@bpoldrack The problem I was trying to attack was the exact same error message you're getting, but from a different pipeline. I didn't add the decode, I just added the if statement to be able to conditionally avoid it in code paths where the decoding doesn't make sense and the git blame is probably just pointing you to my commit because of the indentation change.

yarikoptic commented 4 years ago

ok, I think this PR would be more useful and usable if merged than lingering here as a PR. Any additional changes could follow in new PRs