Null values currently fail validation for EmergencyCareEpisodeSchema due to int64 type

LTHTR-DST / hdruk_avoidable_admissions

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.

https://lthtr-dst.github.io/hdruk_avoidable_admissions/

MIT License

6 stars 5 forks source link

Null values currently fail validation for EmergencyCareEpisodeSchema due to int64 type #44

Open georgm8 opened 1 year ago

georgm8 commented 1 year ago

Following columns are currently failing validation because they contain null values but are set to dtype=np.int64 in the EmergencyCareEpisodeSchema

edcomorb_[0-9]{2}$ eddiag_[0-9]{2}$ eddiagqual_[0-9]{2}$ edentryseq_[0-9]{2}$ edinvest_[0-9]{2}$ edtreat_[0-9]{2}$

Suggest changing to pd.Int64Dtype() to allow null values

Could be changed to float type but you get this awkward situation where pandas adds a decimal point onto the end of the SNOMED code

georgm8 commented 1 year ago

Having had a better look at feature_maps.py I think might be better managed by doing a fillna(0) on the relevant columns!

Was wondering if I could clarify a few other things however @vvcb

1) Missing values have already been accounted for edinvest_[0-9]{2}$ with the SNOMED code specified for this as 1088291000000101

2) However, missing values have not been accounted for in edtreat_[0-9]{2}$ - the HDRUK document specifies the SNOMED code for this as 183964008 , so this could easily be added into feature_maps.py (happy to do this)

3) Only the eddiag_01 column is required for analysis and presumably if this value is missing we should discard the row from the dataset (same applies to eddiagqual_01)

4) There is currently no pipeline to manage edcomorb_[0-9]{2}$ - However, I could pull this code in from admitted_care_features.py

5) Although edentryseq is specified in the Regional Data Specification it doesn't appear to be used in the analysis

vvcb commented 1 year ago

Missing SNOMED codes should be replaced with 0. This avoids pandas NaN issues (I have included a link in the documentation). Is PR #46 still necessary if this is already done?

If there is a specific code for missing values, then this should be included in feature_maps. Will be great if you are happy to do this .

vvcb commented 1 year ago

Not sure if you would discard the entire row unless this column is necessary for the inclusion criteria...in which case it should not be missing. @quindavies , thoughts?
Yes please.
@quindavies ? I haven't looked at the ED spec in detail and am happy to be guided by Quin and others on this. Eyeballs deep in TRE and OMOP work.

quindavies commented 1 year ago

I don't recall the presence of a diagnosis being part of the inclusion criteria, however the analysis tables are all summarised but as ASCS flag which is based on diagnosis? We can't assume that the absence of diagnosis means that the attendance wasn't ambulatory related? What percentage of records does this affect?
No I can't see it being used either 😄 in this or the winter pressures work

georgm8 commented 1 year ago

Yes I believe that is correct - the analysis tables are grouped by 'ACSC' and 'Non-ASCS' which are both derived from the diagnosis. So I think the options would be:

a) Treat absence of diagnosis as 'Non-ACSC' b) Discard the row as we don't have a 'No Diagnosis' category and therefore can't include these patients in the analysis

We have about 10% of patients where there is no diagnosis assigned within the emergency care dataset

vvcb commented 1 year ago

Ah...I see it now 😊. Option b maybe the correct one but worth checking with the lead team regarding how they want this handled. 10% is a sizable proportion to be discarding.