coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
49 stars 13 forks source link

DailyMed NDC->Image File - Initial Work #318

Open jrlegrand opened 2 months ago

jrlegrand commented 2 months ago

Resolves #309

Explanation

While #309 isn't 100% solved, I wanted to start here and treat future enhancements as separate issues to sort of start cleaning up long-standing branches.

This PR provides a way to extract 1 part of the 5 parts of zipped file of DailyMed full prescription SPL data. You can change the part number to download all 5. We need to create an enhancement issue for automating 1-5 with sequential tasks. Honestly the reason I haven't prioritized this is because my hard drive space is horribly low and I can only download one at a time anyway.

This will extract all the files and run XSLT against them to do the following:

  1. Map each SPL to metadata and lists of things contained in the SPL, like image names, NDCs, and components
  2. Try to RegEx match NDCs in image names and compare the matches to the valid NDCs for that SPL
  3. Try to RegEx match NDCs in PRINCIPAL DISPLAY PANEL component sections of the SPL and try to map them to the image files contained within that same component
  4. OCR - not fully built out or tested yet
  5. Barcode scanning - not fully built out or tested yet

Rationale

This gets us to around 50k NDC->image mappings. There is still work to be done to understand the true denominator of label images that are out there in order to understand the delta between where we are now and how we would need to get to 100%.

Tests