Open HelenCEBM opened 2 years ago
re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.
@inglesp on Jon's point above, were the dtypes manually assigned or automatically detected? Do the files actually have other dtypes than those stated? (E.g. the year-month field is stated to be an integer but of the form YYYY-MM which doesn't look like an integer to me)
I'm about to log off for the weekend. Could you and @Jongmassey (maybe @ghickman's could help, as original author?) see if your question's answered by the code?
re: Data structure, the fields defined at source as INTEGER have been imported as STRING in BigQuery. Where these are ID fields (e.g. snomed codes) this causes some minor difficulties joining to other dm+d tables where these ID fields have been correctly typed as integer.
SNOMED codes might be numeric, but saving them as integers causes pain. (Especially with Excel, although that's not relevant here.)
I'd rather store them everywhere as strings (and take a performance hit?) but doing that's probably a big chunk of work.
It seems like it'd be a fairly trivial change to the schema definition here https://github.com/ebmdatalab/openprescribing/blob/b78b21c5cd68500ca7d0f445bcaf7c90b212e1ff/openprescribing/pipeline/management/commands/import_scmd.py#L10-L18
I know R has trouble with 64 bit ints but I see that as an R problem rather than a problem with the database schema! There is a setting within R's db connection utility to auto-cast bigints to strings which resolves this.
I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on foo.bar = cast(baz.qux as string)
etc
I'd be inclined to change to schema to be consistent with the rest of the dm+d tables for avoidance of having to join on
foo.bar = cast(baz.qux as string)
etc
Yeah, fair enough, the casting isn't very nice.
(I still think that using numerical data types to store non-numeric data is asking for trouble!)
Contents
Data Description
Secondary care medicines data (SCMD) contains processed pharmacy stock control data in Dictionary of Medicines and Devices (dm+d) standardised format from all NHS Acute, Teaching, Specialist, Mental Health and Community Trusts in England.
More information available in the guidance document
Important notes / Known issues
Quoted from guidance
Timeliness
Completeness
Exclusions
Other issues
Relationship to other data sets
Data structure
At source:
Data source / consistency
Data is (so far) always in same place and easy to access: https://opendata.nhsbsa.net/dataset/secondary-care-medicines-data
Headers/formats are (so far) consistent except some of the data files use
YYYYMM
, and others useYYYY-MM
.Typo in raw data header
total_**quanity**_in_vmp_unit
Completeness and Range of Data
https://github.com/ebmdatalab/open-nhs-hospital-use-data/issues/19 -> https://nbviewer.org/github/ebmdatalab/open-nhs-hospital-use-data/blob/data-quality-exploration/notebooks/data_quality/scmd_data_quality_explore.html
total_quanity_in_vmp_unit
contains some negative values (this is when stock is returned, see above) and three zeros (however zeros are supposed to be excluded). From guidance:Unexpected values
How to Filter the Data (WIP)
How to Summarise the Data
Joins required and difficulties encountered