Summary of Data Issues - Githubissues

HelenCEBM commented 2 years ago

Data Description

Secondary care medicines data (SCMD) contains processed pharmacy stock control data in Dictionary of Medicines and Devices (dm+d) standardised format from all NHS Acute, Teaching, Specialist, Mental Health and Community Trusts in England.

More information available in the guidance document

Important notes / Known issues

Quoted from guidance

Timeliness

Data is to be provided approximately 1 month in arrears of the previous closed month to allow for complete mapping of the data by Rx-info and the effects of backtracking to propagate through the data before publication.

The primary sources of this data are loaded from hospitals daily in most cases but secondary sources appear monthly and in arrears of up to 6 weeks.

We plan to provide a complete annual refresh of the data two months after the close of a financial year, planned for the end May, which will then be the fixed data set accounting for backtracking. 45 per cent of the data sources from which this extract is based are subject to backtracking. Data backtracking is at its greatest in the 3 months prior to current month and can affect a variable amount of data per month based on Trust type. On this basis data should be treated as provisional until the previous year refresh is provided.
Completeness

The following Trusts currently not represented in this data set as there have been technical challenges extracting from their specific pharmacy systems: • UNIVERSITY COLLEGE LONDON HOSPITALS NHS FOUNDATION TRUST (RRV) • GREAT ORMOND STREET HOSPITAL FOR CHILDREN NHS FOUNDATION TRUST (RP4)

Data for a small number of drugs is not included for ROYAL BROMPTON & HAREFIELD NHS FOUNDATION TRUST (RT3) due to historic Non-Disclosure Agreements.
Exclusions

Where specific medicines [or other items] do not have a dm+d code they cannot be standardised across all organisations and therefore do not appear in this data set. At the current time 5.4% of all hospital pharmacy issued medicine items cannot be mapped to dm+d due to not existing as a dm+d concept. Work is underway with the dm+d editorial team to increase coverage for hospital medicines.

Specialties excluded from data: • Breakages/Damages, Disposal, Expired Stock, Stock Adjustments – not issued to patients • General Sales – non NHS use • GP Prescriptions – accounted for in other data sources • Private Patients – Non NHS spend • Internal Stock Transfers – prevention of double counting issues data

Transactions flagged as outliers: During data quality control, some data points are identified as incorrect when supplied from the publishing hospital. These errors are few but can be material in size and are internally highlighted as erroneous and excluded in this release.

Zero activity: The total quantity of medicines issues can on occasion be found in the source data as zero, these are excluded from this release. These occur as an artifact of the backtracking process.
Other issues

In some cases trusts may merge but continue to have separate pharmacy systems for several years– in such cases their Rx-Info data is collected and reported based on their historic ODS code. We have provided a separate Excel file that tries to identify the ODS codes and the mapping to current ODS codes together with why the code has not been updated

Relationship to other data sets

There's an additional SCMD file containing indicative costs, but only from 2021-05 onwards. We assume the data is otherwise the same but have not checked.
Hospital FP10(HP) forms dispensed in the community are published separately, as is primary care prescribing. It may be important to consider all data together where services may be divided between settings in different areas.

Data structure

At source:

In BigQuery: Field name	Type	Mode
year_month	DATE	NULLABLE
ods_code	STRING	NULLABLE
vmp_snomed_code	STRING	NULLABLE
vmp_product_name	STRING	NULLABLE
unit_of_measure_identifier	STRING	NULLABLE
unit_of_measure_name	STRING	NULLABLE
total_quanity_in_vmp_unit	FLOAT	NULLABLE

Data source / consistency

Data is (so far) always in same place and easy to access: https://opendata.nhsbsa.net/dataset/secondary-care-medicines-data
Headers/formats are (so far) consistent except some of the data files use YYYYMM, and others use YYYY-MM.
Typo in raw data header total_**quanity**_in_vmp_unit

Completeness and Range of Data

https://github.com/ebmdatalab/open-nhs-hospital-use-data/issues/19 -> https://nbviewer.org/github/ebmdatalab/open-nhs-hospital-use-data/blob/data-quality-exploration/notebooks/data_quality/scmd_data_quality_explore.html

All fields are complete (but note Completeness and Exclusions listed above)
Number of records is consistent across months (there was a small drop at the start of the pandemic which may be expected and this gradually recovered)
total_quanity_in_vmp_unit contains some negative values (this is when stock is returned, see above) and three zeros (however zeros are supposed to be excluded). From guidance:

It is technically possible that a Trust can show a negative use of a medicine where supply made in a previous month has been returned in a subsequent one.

Unexpected values

ods_code - contains old values (e.g. trusts which have closed/merged). See Known issues above.
vmp_snomed_code - contains old values (drugs which now have new codes)

How to Filter the Data (WIP)

VMP codelists
Trust IDs

How to Summarise the Data

The total_quantity field cannot be summed across drugs because the units of measurement vary; grouping by unit is not usually meaningful because e.g. for those measured by volume the concentration will vary. Quantities could be converted to mg to calculate totals but again this is not necessarily helpful as even for similar drugs the dosage given may vary widely.
Unless looking at a small handful of drugs it can be useful to map drugs to categories, but there is no clear method for this.
Organisational/geographic summary: prescribing of drug X can be shown as a proportion of
Population denominators: Populations do not really exist at trust level but STP/regional populations could be used, with caveats (e.g. specialist centres may often treat certain conditions/groups of patients from out of area)

Joins required and difficulties encountered

Data field	Data to join	Issues
ods_code	geographic area (STP/region etc)	Need to manually assign any old codes to the appropriate area (no routine source of these codes can be found). There is a mapping file available alongside the data which indicates some trust mergers but not closures. Does not map to wider geographic areas, but could be useful for old trust names.
vmp_snomed_code	DMD ingredients	Causes duplication as there may be multiple ingredients per product
vmp_snomed_code	DMD route of administration	Causes duplication as there may be multiple routes per product (e.g. an injectable could be subcutaneous, intramuscular and IV). However, grouping to broad categories (e.g. injectable, oral, topical, other) should normally remove most duplication.
vmp_snomed_code	DDDs	Duplication due to ingredient specificity; route specificity; DDD units of measurement need to match those in SCMD
vmp_snomed_code	drug categorisation: BNF paragraphs	Not all hospital medicines have a BNF code. For products with BNF codes, first map ingredients to their BNF paragraph and then apply to products without BNF codes. Causes duplication.
vmp_snomed_code	drug categorisation: WHO AWARE (antibiotics)	Mapping lookup text to VMPs; Duplication / route specificity (some antibiotics are classified differently by route e.g. topical vs IV); completeness

Jongmassey commented 2 years ago

re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.

HelenCEBM commented 2 years ago

@inglesp on Jon's point above, were the dtypes manually assigned or automatically detected? Do the files actually have other dtypes than those stated? (E.g. the year-month field is stated to be an integer but of the form YYYY-MM which doesn't look like an integer to me)

inglesp commented 2 years ago

I'm about to log off for the weekend. Could you and @Jongmassey (maybe @ghickman's could help, as original author?) see if your question's answered by the code?

inglesp commented 2 years ago

re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.

SNOMED codes might be numeric, but saving them as integers causes pain. (Especially with Excel, although that's not relevant here.)

I'd rather store them everywhere as strings (and take a performance hit?) but doing that's probably a big chunk of work.

Jongmassey commented 2 years ago

It seems like it'd be a fairly trivial change to the schema definition here https://github.com/ebmdatalab/openprescribing/blob/b78b21c5cd68500ca7d0f445bcaf7c90b212e1ff/openprescribing/pipeline/management/commands/import_scmd.py#L10-L18

I know R has trouble with 64 bit ints but I see that as an R problem rather than a problem with the database schema! There is a setting within R's db connection utility to auto-cast bigints to strings which resolves this.

I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on foo.bar = cast(baz.qux as string) etc

inglesp commented 2 years ago

I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on foo.bar = cast(baz.qux as string) etc

Yeah, fair enough, the casting isn't very nice.

(I still think that using numerical data types to store non-numeric data is asking for trouble!)

ebmdatalab / open-nhs-hospital-use-data

Summary of Data Issues #34

Contents

Data Description

Important notes / Known issues

Relationship to other data sets

Data structure

Data source / consistency

Completeness and Range of Data

Unexpected values

How to Filter the Data (WIP)

How to Summarise the Data

Joins required and difficulties encountered