Closed cmadjar closed 3 years ago
Merging #102 (5fb0adf) into master (d4d94c3) will decrease coverage by
1.57%
. The diff coverage is29.57%
.
@@ Coverage Diff @@
## master #102 +/- ##
==========================================
- Coverage 81.37% 79.79% -1.58%
==========================================
Files 57 60 +3
Lines 4644 4792 +148
==========================================
+ Hits 3779 3824 +45
- Misses 865 968 +103
Impacted Files | Coverage Δ | |
---|---|---|
datalad_crawler/pipelines/loris_bids_export.py | 25.00% <25.00%> (ø) |
|
datalad_crawler/pipelines/loris_data_releases.py | 31.11% <31.11%> (ø) |
|
datalad_crawler/pipelines/loris.py | 34.14% <34.14%> (ø) |
|
datalad_crawler/pipelines/gh.py | 12.82% <0.00%> (+0.32%) |
:arrow_up: |
datalad_crawler/pipelines/tests/test_openfmri.py | 89.28% <0.00%> (+0.79%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update d4d94c3...5fb0adf. Read the comment docs.
Oooops. Very sorry, I meant to open this PR to the CONP-PCNO fork of datalad-crawler...
I was thinking of sending you an email to ask you if you would be interested in adding those crawlers to your code. Well, now you know my evil plan, haha ;).
Anyway, let me know if you would be interested. If not, no worries, we'll keep those in the CONP-PCNO fork.
Thank you!
seeing the code and kaggle
logger being used -- is this a work incorporating earlier #67 and #13? (sorry I have missed it originally)
@yarikoptic Thank you! I am definitely happy to improve the code :-).
The code here is indeed based on #13. I did not see the #67 PR. I was told the code #13 was used for PREVENT-AD so I reused it for the other LORIS instances of CONP datasets. I will check with @mathdugre to see the difference between the crawlers.
The datasets listed in the descriptions are all open (except the PREVENT-AD registered ones which is open only to PIs). However, none of the datasets are small unfortunately... Were you thinking of manual testing or automated testing? I could ask around to see what is possible.
Thank you!
Were you thinking of manual testing or automated testing? I could ask around to see what is possible.
automated would be the ultimate goal. If no suitable smallish dataset is out there, I guess there could be some include
or exclude
option to limit to some subset of files to make test run succinct.
For another LORIS study, I have to write another crawler that would crawl multiple LORIS API endpoints (so multiple URLs). I think that instead of creating one crawler for each API endpoint, I could modify the loris.py
crawler so that users can provide a comma separated list of endpoints that would need to be crawled from the LORIS instance.
I am a little bit unfamiliar with the return statement of a pipeline so I don't know how I should code that return statement.
Let's say I have:
apibase # base URL for the LORIS API
endpoint_1 # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
endpoint_2 # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
lorisapi_1 # LorisAPIExtractor(apibase, annex, endpoint_1) -- extractor for the first URL
lorisapi_2 # LorisAPIExtractor(apibase, annex, endpoint_2) -- extractor for the second URL
*the provided endpoints will return a dictionary that includes a list of files to crawl that will be extracted by the LorisAPIExtractor function
After looking at other templates, my first instinct would be something like that:
return [
[
crawl_url(join(apibase, endpoint_1)),
lorisapi_1,
annex,
[
crawl_url(join(apibase, endpoint_2)),
lorisapi_2,
annex,
],
],
annex.finalize(),
lorisapi_1.finalize(),
lorisapi_2.finalize()
]
But I don't trust my first instinct ;)
Just a little note to tell you to discard my last comment. I figured it out :).
Ultimately, I think all LORIS PRs will be closed and I will send a new one with improved crawlers. Might just take some time though. Will definitely keep you posted.
My slowness was rewarded! ;-) no rush on my end, but I might come handy to review earlier version of RF
To be continued on #103
This pulls the pipelines used to generate several of the CONP datasets hosted in LORIS:
datalad_crawler/pipelines/loris.py
pipeline was used to crawl:datalad_crawler/pipelines/loris_bids_export.py
pipeline was used to crawl:datalad_crawler/pipelines/loris_data_releases.py
pipeline was used to crawl: