While #309 isn't 100% solved, I wanted to start here and treat future enhancements as separate issues to sort of start cleaning up long-standing branches.
This PR provides a way to extract 1 part of the 5 parts of zipped file of DailyMed full prescription SPL data. You can change the part number to download all 5. We need to create an enhancement issue for automating 1-5 with sequential tasks. Honestly the reason I haven't prioritized this is because my hard drive space is horribly low and I can only download one at a time anyway.
This will extract all the files and run XSLT against them to do the following:
Map each SPL to metadata and lists of things contained in the SPL, like image names, NDCs, and components
Try to RegEx match NDCs in image names and compare the matches to the valid NDCs for that SPL
Try to RegEx match NDCs in PRINCIPAL DISPLAY PANEL component sections of the SPL and try to map them to the image files contained within that same component
OCR - not fully built out or tested yet
Barcode scanning - not fully built out or tested yet
Rationale
This gets us to around 50k NDC->image mappings. There is still work to be done to understand the true denominator of label images that are out there in order to understand the delta between where we are now and how we would need to get to 100%.
Resolves #309
Explanation
While #309 isn't 100% solved, I wanted to start here and treat future enhancements as separate issues to sort of start cleaning up long-standing branches.
This PR provides a way to extract 1 part of the 5 parts of zipped file of DailyMed full prescription SPL data. You can change the part number to download all 5. We need to create an enhancement issue for automating 1-5 with sequential tasks. Honestly the reason I haven't prioritized this is because my hard drive space is horribly low and I can only download one at a time anyway.
This will extract all the files and run XSLT against them to do the following:
Rationale
This gets us to around 50k NDC->image mappings. There is still work to be done to understand the true denominator of label images that are out there in order to understand the delta between where we are now and how we would need to get to 100%.
Tests