coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
49 stars 13 forks source link

DailyMed XML processing for NDC -> image #309

Closed jrlegrand closed 1 month ago

jrlegrand commented 4 months ago

Problem Statement

Need to extract a linkage from NDC -> image file name from DailyMed XML.

Criteria for Success

Data mart for NDC -> image

Additional Information

I looked through DailyMed's SPL stylesheet

Probably the bigger question is how we tackle the final piece of consuming the focused XML sections (gleaned / transformed / compiled using XSLT templates) from each pathway.

If we need to OCR images, does that mean we need to unzip all the zip files to get the images out? not sure how much storage space that would take up, but assuming it would be pretty large. Would it make more sense to try to OCR a hosted image instead of the local image? We could get the DailyMed image URL from the XML and maybe point the OCR tool at that URL instead of a local file? There's also a lot of images that have nothing to do with labels (i.e. chemical structure or administration instruction diagrams) that we don't need to bother with unzipping and/or OCR-ing.

If we leave everything zipped (as we do currently), we could spit out a smaller, more focused XML document that Python/pandas can pick up and parse through pretty easily with XPath to create the columns in a dataframe. I am doing the equivalent of this currently in my branch (https://github.com/coderxio/sagerx/tree/jrlegrand/dailymed), but using SQL. Meaning - the smaller XML document is stored in an xml column in Postgres, and then dbt models use SQL to do essentially what pandas would do to convert the smaller XML document to columns in one or more tables.

I don't know what the performance or memory usage limitations would be for both of these options, but assume it might be better to go the pandas route for memory reasons.... not sure. I did run into an error (#238) when originally trying to load ALL SPLs, but things have changed since then which may make that error moot.