coderxio / sagerx

Open drug data pipelines curated by pharmacists.
https://coderx.io/sagerx
Other
41 stars 11 forks source link

Duplicate excipient rows #296

Closed jrlegrand closed 2 days ago

jrlegrand commented 5 days ago

Problem Statement

MTHSPL substances likely have duplicate names per substance RXCUI - look into combining them. Same thing for products - examples are lip balms with different flavors.

This results in duplicate rows in the mart.

Criteria for Success

I think a query similar to this should result in one row count per row.

select ndc9 , product_rxcui , product_name , inactive_ingredient_unii , inactive_ingredient_unii_display_name , inactive_ingredient_rxcui , count() from sagerx_dev.int_fda_unii_mthspl_products_to_unii_inactive_ingredients group by ndc9 , product_rxcui , product_name , inactive_ingredient_unii , inactive_ingredient_unii_display_name , inactive_ingredient_rxcui order by count() desc

Additional Information

image

PS - not specific to this issue, but it seems like FDA UNII codes get mapped to one another... might need to accomodate for this.

image

jrlegrand commented 5 days ago

Ran this code on my branch and got one count per row.

select
ndc9
, product_rxcui
, product_name
, inactive_ingredient_unii
, fda_unii_display_name
, inactive_ingredient_rxcui
, count(*)
from sagerx_dev.products_to_inactive_ingredients
group by
ndc9
, product_rxcui
, product_name
, inactive_ingredient_unii
, fda_unii_display_name
, inactive_ingredient_rxcui
order by count(*) desc