Open georgm8 opened 1 year ago
Having had a better look at feature_maps.py I think might be better managed by doing a fillna(0) on the relevant columns!
Was wondering if I could clarify a few other things however @vvcb
1) Missing values have already been accounted for edinvest_[0-9]{2}$
with the SNOMED code specified for this as 1088291000000101
2) However, missing values have not been accounted for in edtreat_[0-9]{2}$
- the HDRUK document specifies the SNOMED code for this as 183964008 , so this could easily be added into feature_maps.py (happy to do this)
3) Only the eddiag_01
column is required for analysis and presumably if this value is missing we should discard the row from the dataset (same applies to eddiagqual_01
)
4) There is currently no pipeline to manage edcomorb_[0-9]{2}$
- However, I could pull this code in from admitted_care_features.py
5) Although edentryseq
is specified in the Regional Data Specification it doesn't appear to be used in the analysis
Missing SNOMED codes should be replaced with 0. This avoids pandas NaN issues (I have included a link in the documentation). Is PR #46 still necessary if this is already done?
If there is a specific code for missing values, then this should be included in feature_maps. Will be great if you are happy to do this .
Not sure if you would discard the entire row unless this column is necessary for the inclusion criteria...in which case it should not be missing. @quindavies , thoughts?
Yes please.
@quindavies ? I haven't looked at the ED spec in detail and am happy to be guided by Quin and others on this. Eyeballs deep in TRE and OMOP work.
a) Treat absence of diagnosis as 'Non-ACSC' b) Discard the row as we don't have a 'No Diagnosis' category and therefore can't include these patients in the analysis
We have about 10% of patients where there is no diagnosis assigned within the emergency care dataset
Ah...I see it now 😊. Option b maybe the correct one but worth checking with the lead team regarding how they want this handled. 10% is a sizable proportion to be discarding.
Following columns are currently failing validation because they contain null values but are set to
dtype=np.int64
in theEmergencyCareEpisodeSchema
edcomorb_[0-9]{2}$
eddiag_[0-9]{2}$
eddiagqual_[0-9]{2}$
edentryseq_[0-9]{2}$
edinvest_[0-9]{2}$
edtreat_[0-9]{2}$
Suggest changing to
pd.Int64Dtype()
to allow null valuesCould be changed to float type but you get this awkward situation where pandas adds a decimal point onto the end of the SNOMED code