ebmdatalab / open-nhs-hospital-use-data

For analysis of https://opendata.nhsbsa.net/dataset/secondary-care-medicines-data
0 stars 0 forks source link

Validating SCMD + DDD data quality with profiling tools #19

Open robinyjpark opened 2 years ago

robinyjpark commented 2 years ago

I started exploring the merged SCMD and DDD data to track the completeness of each field, flag products with missing DDD and counted how many hospitals appeared in the data by month. The notebook can be viewed here.

I used ProfileReport from pandas_profiling to automatically generate summaries of each column to check for completeness. I chose this tool as it seems to allow automatic generation of expectations that can be fed into Great Expectations to test whether the field values are sensible (documentation; more investigation to be done).

The notebook also contains a distinct list of products with missing DDD (6,756 products), and a table and interactive plot displaying the number of hospitals per month.

Jongmassey commented 2 years ago

Nice work.

A deeper dive into some of these missing DDDs has shown some curious examples...

Only 2of 14 injectable Vancomycin VMPs (identified by ATC==J01XA01 - whocc page for this code ) has a DDD.

The ones with a DDD figure are

And those without are

In the DDD table, only the former two have a BNF code (5010700). Not sure if this is related.

The omission of a DDD for the latter set feels like an omission from TRUD. Thoughts @brianmackenna @richiecroker @orlamac ?

Jongmassey commented 2 years ago

Further to my previous comment, I've done a slightly deeper dive into this issue along with a nice UpSet plot of the column population rates in the DDD dataset: https://github.com/ebmdatalab/open-nhs-hospital-use-data/blob/ddd_data_quality/notebooks/antibiotics/ddd_quality.ipynb