Need to extract a linkage from NDC -> image file name from DailyMed XML.
Criteria for Success
Data mart for NDC -> image
Additional Information
I looked through DailyMed's SPL stylesheet
I think there's some neat tricks we can learn about XML from these, and my main takeaway is that if we can really understand how DailyMed crafts their XML template for their website, that's the closest source of truth
Specifically for the ObservationMedia stuff, i think we are doing basically what DailyMed is doing mostly - though there's some specialized stuff they are doing that may or may not be important
Probably the bigger question is how we tackle the final piece of consuming the focused XML sections (gleaned / transformed / compiled using XSLT templates) from each pathway.
If we need to OCR images, does that mean we need to unzip all the zip files to get the images out? not sure how much storage space that would take up, but assuming it would be pretty large. Would it make more sense to try to OCR a hosted image instead of the local image? We could get the DailyMed image URL from the XML and maybe point the OCR tool at that URL instead of a local file? There's also a lot of images that have nothing to do with labels (i.e. chemical structure or administration instruction diagrams) that we don't need to bother with unzipping and/or OCR-ing.
If we leave everything zipped (as we do currently), we could spit out a smaller, more focused XML document that Python/pandas can pick up and parse through pretty easily with XPath to create the columns in a dataframe. I am doing the equivalent of this currently in my branch (https://github.com/coderxio/sagerx/tree/jrlegrand/dailymed), but using SQL. Meaning - the smaller XML document is stored in an xml column in Postgres, and then dbt models use SQL to do essentially what pandas would do to convert the smaller XML document to columns in one or more tables.
Using pandas would mean these tables are materialized.
Using dbt means we can decide whether we want them to be materialized in the sagerx_lake schema (this might be a weird use of dbt - maybe they would end up as materialized staging tables in sagerx_dev), or whether we want them to be normal staging views in sagerx_dev.
I don't know what the performance or memory usage limitations would be for both of these options, but assume it might be better to go the pandas route for memory reasons.... not sure. I did run into an error (#238) when originally trying to load ALL SPLs, but things have changed since then which may make that error moot.
Problem Statement
Need to extract a linkage from NDC -> image file name from DailyMed XML.
Criteria for Success
Data mart for NDC -> image
Additional Information
I looked through DailyMed's SPL stylesheet
Probably the bigger question is how we tackle the final piece of consuming the focused XML sections (gleaned / transformed / compiled using XSLT templates) from each pathway.
If we need to OCR images, does that mean we need to unzip all the zip files to get the images out? not sure how much storage space that would take up, but assuming it would be pretty large. Would it make more sense to try to OCR a hosted image instead of the local image? We could get the DailyMed image URL from the XML and maybe point the OCR tool at that URL instead of a local file? There's also a lot of images that have nothing to do with labels (i.e. chemical structure or administration instruction diagrams) that we don't need to bother with unzipping and/or OCR-ing.
If we leave everything zipped (as we do currently), we could spit out a smaller, more focused XML document that Python/pandas can pick up and parse through pretty easily with XPath to create the columns in a dataframe. I am doing the equivalent of this currently in my branch (https://github.com/coderxio/sagerx/tree/jrlegrand/dailymed), but using SQL. Meaning - the smaller XML document is stored in an xml column in Postgres, and then dbt models use SQL to do essentially what pandas would do to convert the smaller XML document to columns in one or more tables.
I don't know what the performance or memory usage limitations would be for both of these options, but assume it might be better to go the pandas route for memory reasons.... not sure. I did run into an error (#238) when originally trying to load ALL SPLs, but things have changed since then which may make that error moot.