LTHTR-DST / hdruk_avoidable_admissions

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.
https://lthtr-dst.github.io/hdruk_avoidable_admissions/
MIT License
6 stars 5 forks source link

Validation fails if all values are NaN for str type columns #31

Closed georgm8 closed 1 year ago

georgm8 commented 1 year ago

As we don't have any values for 'dismeth' the entire column contains NaN values. In this case, Pandas will infer that this column is of type float64. To attempt to pass validation this column can be converted to str type but this means that it still fails validation as the string 'nan' is not included in the accepted values.

good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
# dismeth dtype('str')     [float64] 

df['dismeth'] = df['dismeth'].astype(str)
good, bad = validate_dataframe(df, AdmittedCareEpisodeSchema)
# dismeth  isin({'9', '2', '4', '1', '3', '8', '5'})    [nan]
vvcb commented 1 year ago

Thanks for reporting @georgm8. Duplicate of #19. For unknown/missing values, use '9' for dismeth and '99' for admisorc as defined in NHS data model.