SAFEHR-data / PIXL

PIXL Image eXtraction Laboratory
Apache License 2.0
8 stars 0 forks source link

Batch checking of exported data #430

Closed p-j-smith closed 3 months ago

p-j-smith commented 3 months ago

Description

Fixes #397: Add batch querying of existing images and batch upload of new images

Based on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review

Type of change

Please delete options accordingly to the description.

Suggested Checklist

mxochicale commented 3 months ago

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

Indeed, example_messages_df generates this kind of message (with different timestamps):

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              234  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00
2  mrn              345  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727422+00:00 

So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps?

stefpiatek commented 3 months ago

Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00
codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 98.02632% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.01%. Comparing base (282f64b) to head (78bc732).

Files Patch % Lines
cli/src/pixl_cli/_io.py 95.34% 2 Missing :warning:
cli/src/pixl_cli/_message_processing.py 95.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #430 +/- ## ========================================== + Coverage 80.09% 84.01% +3.92% ========================================== Files 75 83 +8 Lines 3245 3528 +283 ========================================== + Hits 2599 2964 +365 + Misses 646 564 -82 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

p-j-smith commented 3 months ago

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime.

Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using df.isin - if you're comparing two dataframes then the indices need to match

p-j-smith commented 3 months ago

Ah so nice, went down from 15 minutes to 4 seconds ❤️

Oh nice, that's pretty impressive