Closed p-j-smith closed 3 months ago
Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same
accession number
andmrn
, but a different datetime. Does that make sense in terms of what we'd need? Worth adding a test case that represents this
Indeed, example_messages_df
generates this kind of message (with different timestamps):
mrn accession_number study_date procedure_occurrence_id project_name extract_generated_timestamp
0 mrn 123 2023-01-01 1 i-am-a-project 2024-07-25 09:41:58.727406+00:00
1 mrn 234 2023-01-01 1 i-am-a-project 2024-07-25 09:41:58.727419+00:00
2 mrn 345 2023-01-01 1 i-am-a-project 2024-07-25 09:41:58.727422+00:00
So, should we test against equal vs different dataframes? It would be nice that you share three lines of anonymised real data to help us to create such test case! Maybe it is just the same as above with the same timestamps?
Ah you'd want something like this in a test. Same MRN and accession number but different timestamp. I think @p-j-smith might be having a look at this so worth chatting
mrn accession_number study_date procedure_occurrence_id project_name extract_generated_timestamp
0 mrn 123 2023-01-01 1 i-am-a-project 2024-07-25 09:41:58.727406+00:00
1 mrn 123 2023-01-01 1 i-am-a-project 2024-07-25 09:41:58.727419+00:00
Attention: Patch coverage is 98.02632%
with 3 lines
in your changes missing coverage. Please review.
Project coverage is 84.01%. Comparing base (
282f64b
) to head (78bc732
).
Files | Patch % | Lines |
---|---|---|
cli/src/pixl_cli/_io.py | 95.34% | 2 Missing :warning: |
cli/src/pixl_cli/_message_processing.py | 95.00% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Tested on some real data and I think we might need to make sure that we make the input dataframe distinct on the joining columns. Looks like there is more than one row with the same
accession number
andmrn
, but a different datetime.
Thanks for catching this! We now drop duplicates when loading the file, and we've added a test case for it. We also needed to fix how we were filtering existing images using df.isin
- if you're comparing two dataframes then the indices need to match
Ah so nice, went down from 15 minutes to 4 seconds ❤️
Oh nice, that's pretty impressive
Description
Fixes #397: Add batch querying of existing images and batch upload of new images
pixl_cli._io.read_patient_data
function that takes either a CSV or parquet files and returns a dataframe of messagespixl_cli._message_processing.messages_from_df
to convert the df to a list of messagespixl_cli._message_processing.populate_queue_and_db
to take a df of messages and return a list of the messages addedpixl_cli._database.filter_exported_or_add_to_db
to filter messages in memory using dfs rather than querying db multiple timessession.bulk_save_objects(images)
to batch insert imagestest_batch_upload
to test the batch querying and uploading of dataBased on #429 (thanks @stefpiatek!), opened a new PR so @stefpiatek can review
Type of change
Please delete options accordingly to the description.
Suggested Checklist
main
branch.UCLH-Foundtry/arc-dev
squash and merge