Batch checking of exported data

p-j-smith commented 3 months ago

Description

Fixes #397: Add batch querying of existing images and batch upload of new images

Add pixl_cli._io.read_patient_data function that takes either a CSV or parquet files and returns a dataframe of messages
Add pixl_cli._message_processing.messages_from_df to convert the df to a list of messages
Update pixl_cli._message_processing.populate_queue_and_db to take a df of messages and return a list of the messages added
Update pixl_cli._database.filter_exported_or_add_to_db to filter messages in memory using dfs rather than querying db multiple times
Use session.bulk_save_objects(images) to batch insert images
Also add test test_batch_upload to test the batch querying and uploading of data

Based on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review

Type of change

Please delete options accordingly to the description.

[ ] Bug fix (non-breaking change which fixes an issue)
[X] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

Suggested Checklist

[x] I have performed a self-review of my own code.
[x] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[x] I have commented my code, particularly in hard-to-understand areas.
[X] I have passed on my local host device. (see further details at the CONTRIBUTING document)
[X] Make sure your branch is up-to-date with main branch. See CONTRIBUTING for a general example to syncronise your branch with the main branch.
[X] I have requested PR review to UCLH-Foundtry/arc-dev
[x] I have adddressed and marked as resolved all the review comments in my PR.
[x] Finally, I have selected squash and merge

mxochicale commented 3 months ago

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this

Indeed, example_messages_df generates this kind of message (with different timestamps):

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              234  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00
2  mrn              345  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727422+00:00

So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps?

stefpiatek commented 3 months ago

Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting

   mrn accession_number  study_date  procedure_occurrence_id    project_name      extract_generated_timestamp
0  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727406+00:00
1  mrn              123  2023-01-01                        1  i-am-a-project 2024-07-25 09:41:58.727419+00:00

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 98.02632% with 3 lines in your changes missing coverage. Please review.

Project coverage is 84.01%. Comparing base (282f64b) to head (78bc732).

Files	Patch %	Lines
cli/src/pixl_cli/_io.py	95.34%	2 Missing :warning:
cli/src/pixl_cli/_message_processing.py	95.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #430 +/- ## ========================================== + Coverage 80.09% 84.01% +3.92% ========================================== Files 75 83 +8 Lines 3245 3528 +283 ========================================== + Hits 2599 2964 +365 + Misses 646 564 -82 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

p-j-smith commented 3 months ago

Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same accession number and mrn, but a different datetime.

Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using df.isin - if you're comparing two dataframes then the indices need to match

p-j-smith commented 3 months ago

Ah so nice, went down from 15 minutes to 4 seconds ❤️

Oh nice, that's pretty impressive

SAFEHR-data / PIXL