ebmdatalab / open-nhs-hospital-use-data

For analysis of https://opendata.nhsbsa.net/dataset/secondary-care-medicines-data
0 stars 0 forks source link

Summary of Data Issues #34

Open HelenCEBM opened 2 years ago

HelenCEBM commented 2 years ago

Contents

Data Description

Secondary care medicines data (SCMD) contains processed pharmacy stock control data in Dictionary of Medicines and Devices (dm+d) standardised format from all NHS Acute, Teaching, Specialist, Mental Health and Community Trusts in England.

More information available in the guidance document

Important notes / Known issues

Quoted from guidance

Relationship to other data sets

Data structure

At source: image image

In BigQuery: Field name Type Mode
year_month DATE NULLABLE
ods_code STRING NULLABLE
vmp_snomed_code STRING NULLABLE  
vmp_product_name STRING NULLABLE
unit_of_measure_identifier STRING NULLABLE
unit_of_measure_name STRING NULLABLE
total_quanity_in_vmp_unit FLOAT NULLABLE

Data source / consistency

Completeness and Range of Data

https://github.com/ebmdatalab/open-nhs-hospital-use-data/issues/19 -> https://nbviewer.org/github/ebmdatalab/open-nhs-hospital-use-data/blob/data-quality-exploration/notebooks/data_quality/scmd_data_quality_explore.html

Unexpected values

How to Filter the Data (WIP)

How to Summarise the Data

Joins required and difficulties encountered

Data field Data to join Issues
ods_code geographic area (STP/region etc) Need to manually assign any old codes to the appropriate area (no routine source of these codes can be found). There is a mapping file available alongside the data which indicates some trust mergers but not closures. Does not map to wider geographic areas, but could be useful for old trust names.
vmp_snomed_code DMD ingredients Causes duplication as there may be multiple ingredients per product
vmp_snomed_code DMD route of administration Causes duplication as there may be multiple routes per product (e.g. an injectable could be subcutaneous, intramuscular and IV). However, grouping to broad categories (e.g. injectable, oral, topical, other) should normally remove most duplication.
vmp_snomed_code DDDs Duplication due to ingredient specificity; route specificity; DDD units of measurement need to match those in SCMD
vmp_snomed_code drug categorisation: BNF paragraphs Not all hospital medicines have a BNF code. For products with BNF codes, first map ingredients to their BNF paragraph and then apply to products without BNF codes. Causes duplication.
vmp_snomed_code drug categorisation: WHO AWARE (antibiotics) Mapping lookup text to VMPs; Duplication / route specificity (some antibiotics are classified differently by route e.g. topical vs IV); completeness
Jongmassey commented 2 years ago

re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.

HelenCEBM commented 2 years ago

@inglesp on Jon's point above, were the dtypes manually assigned or automatically detected? Do the files actually have other dtypes than those stated? (E.g. the year-month field is stated to be an integer but of the form YYYY-MM which doesn't look like an integer to me)

inglesp commented 2 years ago

I'm about to log off for the weekend. Could you and @Jongmassey (maybe @ghickman's could help, as original author?) see if your question's answered by the code?

inglesp commented 2 years ago

re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.

SNOMED codes might be numeric, but saving them as integers causes pain. (Especially with Excel, although that's not relevant here.)

I'd rather store them everywhere as strings (and take a performance hit?) but doing that's probably a big chunk of work.

Jongmassey commented 2 years ago

It seems like it'd be a fairly trivial change to the schema definition here https://github.com/ebmdatalab/openprescribing/blob/b78b21c5cd68500ca7d0f445bcaf7c90b212e1ff/openprescribing/pipeline/management/commands/import_scmd.py#L10-L18

I know R has trouble with 64 bit ints but I see that as an R problem rather than a problem with the database schema! There is a setting within R's db connection utility to auto-cast bigints to strings which resolves this.

I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on foo.bar = cast(baz.qux as string) etc

inglesp commented 2 years ago

I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on foo.bar = cast(baz.qux as string) etc

Yeah, fair enough, the casting isn't very nice.

(I still think that using numerical data types to store non-numeric data is asking for trouble!)