datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

LORIS crawlers used to crawl several of the CONP datasets #102

Closed cmadjar closed 3 years ago

cmadjar commented 3 years ago

This pulls the pipelines used to generate several of the CONP datasets hosted in LORIS:

codecov[bot] commented 3 years ago

Codecov Report

Merging #102 (5fb0adf) into master (d4d94c3) will decrease coverage by 1.57%. The diff coverage is 29.57%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #102      +/-   ##
==========================================
- Coverage   81.37%   79.79%   -1.58%     
==========================================
  Files          57       60       +3     
  Lines        4644     4792     +148     
==========================================
+ Hits         3779     3824      +45     
- Misses        865      968     +103     
Impacted Files Coverage Δ
datalad_crawler/pipelines/loris_bids_export.py 25.00% <25.00%> (ø)
datalad_crawler/pipelines/loris_data_releases.py 31.11% <31.11%> (ø)
datalad_crawler/pipelines/loris.py 34.14% <34.14%> (ø)
datalad_crawler/pipelines/gh.py 12.82% <0.00%> (+0.32%) :arrow_up:
datalad_crawler/pipelines/tests/test_openfmri.py 89.28% <0.00%> (+0.79%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update d4d94c3...5fb0adf. Read the comment docs.

cmadjar commented 3 years ago

Oooops. Very sorry, I meant to open this PR to the CONP-PCNO fork of datalad-crawler...

I was thinking of sending you an email to ask you if you would be interested in adding those crawlers to your code. Well, now you know my evil plan, haha ;).

Anyway, let me know if you would be interested. If not, no worries, we'll keep those in the CONP-PCNO fork.

Thank you!

yarikoptic commented 3 years ago

seeing the code and kaggle logger being used -- is this a work incorporating earlier #67 and #13? (sorry I have missed it originally)

cmadjar commented 3 years ago

@yarikoptic Thank you! I am definitely happy to improve the code :-).

The code here is indeed based on #13. I did not see the #67 PR. I was told the code #13 was used for PREVENT-AD so I reused it for the other LORIS instances of CONP datasets. I will check with @mathdugre to see the difference between the crawlers.

The datasets listed in the descriptions are all open (except the PREVENT-AD registered ones which is open only to PIs). However, none of the datasets are small unfortunately... Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

Thank you!

yarikoptic commented 3 years ago

Were you thinking of manual testing or automated testing? I could ask around to see what is possible.

automated would be the ultimate goal. If no suitable smallish dataset is out there, I guess there could be some include or exclude option to limit to some subset of files to make test run succinct.

cmadjar commented 3 years ago

For another LORIS study, I have to write another crawler that would crawl multiple LORIS API endpoints (so multiple URLs). I think that instead of creating one crawler for each API endpoint, I could modify the loris.py crawler so that users can provide a comma separated list of endpoints that would need to be crawled from the LORIS instance.

I am a little bit unfamiliar with the return statement of a pipeline so I don't know how I should code that return statement.

Let's say I have:

apibase     # base URL for the LORIS API
endpoint_1  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
endpoint_2  # endpoint number 1 of the API (to have the complete URL, would need to join apibase and endpoint_1)
lorisapi_1  # LorisAPIExtractor(apibase, annex, endpoint_1) -- extractor for the first URL
lorisapi_2  # LorisAPIExtractor(apibase, annex, endpoint_2) -- extractor for the second URL

*the provided endpoints will return a dictionary that includes a list of files to crawl that will be extracted by the LorisAPIExtractor function

After looking at other templates, my first instinct would be something like that:

return [
        [
            crawl_url(join(apibase, endpoint_1)),
            lorisapi_1,
            annex,
            [
                crawl_url(join(apibase, endpoint_2)),
                lorisapi_2,
                annex,
            ],
        ],
        annex.finalize(),
        lorisapi_1.finalize(),
        lorisapi_2.finalize()
    ]

But I don't trust my first instinct ;)

cmadjar commented 3 years ago

Just a little note to tell you to discard my last comment. I figured it out :).

Ultimately, I think all LORIS PRs will be closed and I will send a new one with improved crawlers. Might just take some time though. Will definitely keep you posted.

yarikoptic commented 3 years ago

My slowness was rewarded! ;-) no rush on my end, but I might come handy to review earlier version of RF

cmadjar commented 3 years ago

To be continued on #103