Closed chaselgrove closed 4 years ago
Merging #9 into master will decrease coverage by
2.95%
. The diff coverage is36.01%
.
@@ Coverage Diff @@
## master #9 +/- ##
==========================================
- Coverage 82.55% 79.60% -2.96%
==========================================
Files 53 55 +2
Lines 4214 4500 +286
==========================================
+ Hits 3479 3582 +103
- Misses 735 918 +183
Impacted Files | Coverage Δ | |
---|---|---|
datalad_crawler/pipelines/xnat.py | 25.41% <25.41%> (ø) |
|
datalad_crawler/pipelines/tests/test_xnat.py | 54.28% <54.28%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b1210ee...6adab7b. Read the comment docs.
@mih if interested to try -- I pushed up some tune ups so there is no (as) heavy cost for (re)crawling a single dataset. I am recrawling a few sample datasets ATM and then will try to run it afresh for NITRC-ir and/or xnat central to see what we would get. For the final one I ideally should extend the pipeline with
and then we could populate some datasets if there is an interest to get access to xnat central or nitrc-ir. FWIW -- I have not tried with non-public setups yet and some logic for dropping non-empty ones would restrict only to "public" - not sure if I should just kick that away. may be @chaselgrove remembers why there was that limit
Ping @bpoldrack
Looking into it.
Tried to run the tests first and failed. Probably a minor thing (assuming that for some reason test_nitrc_superpipeline
doesn't raise SKIP: This test requires known credentials for nitrc
while test_nitrc_pipeline
does. @yarikoptic you might want to have a look:
Edit: Whether or not that is indeed the issue with that particular test, we should have a proper error message instead of just crashing in such cases.
Thanks! Would you mind get into debugger at that point and print j ?
Here you go, @yarikoptic
@yarikoptic Hehe. I just blindly c&p'ed here, but didn't even look at it.
Actually, it seems to be a python version issue:
(Pdb) print(j.keys())
dict_keys(['resultset'])
(Pdb) type(j.keys())
<class 'dict_keys'>
$ python --version
Python 3.7.3
hm, I am confused -- I cut/pasted that thing in ipython and for it assert j.keys() == ['resultset']
is good. What is j.keys()
for you?
ah, right -- python version! coolio, will push that fix
Thx!
Trying to use it for real - will report.
I'm somewhat confused by the parameters. If I get it right, the pipeline takes url
, dataset
and project_access
. Didn't look into the last, but judging from the hierarchy created by the test, dataset
actually is the project, right? If so, can we name it that way? Not only is it confusing since a "dataset" is what I'm putting that information into, but also since it's xnat.py
- so lets use "XNAT speak", no?
Yep, sounds reasonable. Try pushing your changes
Ok, I'll assemble those little things and then try to push.
Testing this will take at least till tomorrow - no need to block Travis several times for those tiny changes.
Another python 3 issue, I guess.
I crawled a nitrc superdataset and then tried to crawl its subdataset studyforrest_rev003
, resulting in [ERROR ] ...>downloaders.base:603,145,618,554 Failed to fetch https://www.nitrc.org/ir/data/experiments/NITRC_IR_E07478/scans/T1/resources/54906/files/highres001_dicominfo.txt: cannot use a string pattern on a bytes-like object [base.py:_fetch:548,base.py:_verify_download:343,http.py:check_for_auth_failure:209,re.py:search:183]
Trying to track it through the downloaders to figure where exactly to fix it.
Not yet sure, what's the ultimate issue here and whether it needs to be fixed in datalad rather than the crawler or this PR, but leaving a trace here in case it takes longer:
Apparently sometimes (didn't figure a pattern yet) BaseDownloader._fetch()
is called with explicit decode=False
from within _get_status()
. I don't know yet how it sometimes manages to not go through _get_status()
nor am I clear on why you would the decoding to be disabled, but that's where I am ATM.
Okay. I think I got it. This particular expression of the decoding challenges is coming from:
Author: Dave MacFarlane <dave.macfarlane@mcin.ca>
Date: Thu Mar 7 15:38:54 2019 -0500
Do not decode fetched content when retrieving status
When annexificator is used in a datalad-crawler pipeline, it attempts
to call get_status on the URL to see if the file has changed. get_status
attempts to use _fetch, which attempts to decode the content under Python3.
This is invalid for content that isn't utf-8, and get_status was only
trying to verify that it wasn't receiving a login or failure page or similar.
_fetch, confusingly, converts the UnicodeFormatError into a generic DownloadError
(despite the fact that the download succeeded, but conversion to a string failed.)
This teaches _fetch a "decode" parameter that toggles whether or not it should
even attempt to decode. In the case of get_status, the content is discarded,
so the decode serves no purpose.
NB. fetch (without the _)'s documentation claims that it doesn't decode, but
it calls _fetch (which currently always decodes.) This can new parameter can
eventually be used to fix that, but currently trying to fix "fetch" results in
a json.loads error from a mysterious part in the code if fetch is updated to
not decode as documented.
So, while I agree that binary_type
might not necessarily mean that it's UTF-8
, it still needs to be decoded since it will be passed on and then regex will fail when checking that response content for an error message with a TypeError
.
I might miss something, but I think we should switch back to not have that decode
parameter and instead use decode(downloader_session.response.encoding)
rather than a mere decode()
, which defaults to UTF-8
.
What do you think, @yarikoptic ? I'm not entirely sure what issue @driusan was attacking with that commit. So, would like to see whether the two of you see any issue with my approach.
@bpoldrack:
I'm not entirely sure what issue @driusan was attacking with that commit. So, would like to see whether the two of you see any issue with my approach.
A specific example of the failure is shown here: https://github.com/datalad/datalad/pull/3210#pullrequestreview-212374896
@kyleam : Looking through it.
FWIW: My suggested fix still isn't entirely correct. While this allows me to download quite some files it still crashes later on when running into content that doesn't come with any encoding information in the response and doesn't look like it should be decoded (binary data?). This is suggesting to me, that the check for an error message simply doesn't work that way (since a regex on binary content is indeed pointless).
I'm currently thinking we need both: A better condition on when to decode wrt what encoding and in addition try to re.search
for an authentication error only, if we actually have a string at that point.
I'm currently thinking we need both: A better condition on when to decode wrt what encoding and in addition try to
re.search
for an authentication error only, if we actually have a string at that point.
I haven't quite (re-)digested the situation, but, yeah, this is likely to need a substantial rework. As touched on in datalad/datalad#3210, fetch() and fetch_() claim to be working with bytes, but some spots like _verify_download() treat the content a decoded string, so it smells like things need a clearer separation around pre and post decoding.
If memory serves (it's been a while) the root of the problem seemed to be that it was decoding the response in order to run a failure regex on it to see if it matches the failure regex defined in the config file. Decoding the error page makes sense, because it's usually HTML, but if there is no error and the response is binary data, the decoding doesn't make sense and can throw an exception. The code path that I added the parameter for was only incidentally triggering that code when trying to check the status of if the file changed.
@driusan : Thanks! Yes, that's pretty much what I ran into as well (despite your patch). So the take away is: The patch was about the same thing and while it fixed it for some cases it doesn't in others (as does the one initially suggested by me).
Ok. So, several things:
decode
parameter with a check whether downloader_session.response
comes with a filled encoding field and if so use that to decode) it doesn't fail at this point, but instead later on it tries to shoot that verification regex onto something that appears to be binary data (so - correctly not decoded)Conclusion: The pipeline itself seems fine. The decoding issue is ... well, not one issue. And it happens on particular responses only. It's neither the kind of file that is downloaded nor the server per se, that would fail to properly announce an encoding.
@bpoldrack The problem I was trying to attack was the exact same error message you're getting, but from a different pipeline. I didn't add the decode, I just added the if
statement to be able to conditionally avoid it in code paths where the decoding doesn't make sense and the git blame
is probably just pointing you to my commit because of the indentation change.
ok, I think this PR would be more useful and usable if merged than lingering here as a PR. Any additional changes could follow in new PRs
Tests require DATALAD_TEST_XNAT to be set (even with a VCR cassette in place, the datalad_crawler.pipelines.tests.test_xnat tests take 4-5 hours).
I still can't get test_nitrc_pipeline to run, even after a datalad download-url to NITRC. I avoid the test error by using get_test_providers('https://www.nitrc.org/ir/').