LTHTR-DST / hdruk_avoidable_admissions

HDRUK Data Science Collaboration on Avoidable Admissions in the NHS.
https://lthtr-dst.github.io/hdruk_avoidable_admissions/
MIT License
6 stars 5 forks source link

replace_values() throws TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 #45

Closed georgm8 closed 1 year ago

georgm8 commented 1 year ago

replace_values() function throws the error TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 as it is trying to replace values in a column with the string value in the variable other in instances where the Pandas Series data is not a string.

Quick fix suggested is to change the series to a string and also replace dictionary keys with strings.

# emergency_care_features.py
def replace_values(
    data: pd.Series, replacements: dict, other: str = "ERROR:Unmapped - Not In Refset"
) -> pd.Series:
    # if value is in replacements, keep the value, else use `other` for all others
    # then use replacements to assign the other categories

    # Convert the replacements dictionary to strings and data type to str to allow replacement by other 
    replacements_str = {str(k): v for k, v in replacements.items()}
    data = data.astype(str)

    data_cat = (
        # data.where(data.isin(replacements), other).replace(replacements).astype(str)
        data.where(data.isin(replacements_str), other).replace(replacements_str).astype(str)
    )

    return data_cat
georgm8 commented 1 year ago

Pull request #46

vvcb commented 1 year ago

Thanks for reporting this @georgm8 . The data expected in this column are SNOMED codes which are integers rather than strings. feature_maps.py generates the map between the SNOMED codes as integers and string categories.

Not sure why that error appears. Looking at this on my phone at the moment. Will check this evening and merge.

Regarding the SNOMED codes for missing data, I agree that they should go in feature_maps along with 0 which is already in there I think.

vvcb commented 1 year ago

@georgm8, I have rerun v0.3.1 on the LTH data and don't get this error. The following is the truncated output of good.dtypes after the first validation. There shouldn't really be any Int64 dtypes unless you are coercing columns into this in a previous step. Is it possible that this may have been introduced to allow nan in SNOMED columns instead of assigning 0 or one of the allowed values for missing or unknown values.

Can you please check and close this issue if this explains it?

Also please see https://github.com/pandas-dev/pandas/issues/45729.

column dtype
patient_id int64
visit_id int64
townsend_score_quintile int64
gender object
activage int64
ethnos object
accommodationstatus int64
procodet object
edsitecode object
eddepttype object
edarrivalmode int64
edattendcat object
edattendsource int64
edarrivaldatetime datetime64[ns, UTC]
edwaittime float64
edacuity int64
edchiefcomplaint int64
edcomorb_01 int64
eddiag_NN int64
edentryseq_NN int64
eddiagqual_NN int64
edinvest_NN int64
edtreat_NN int64
timeined float64
disstatus int64
edattenddispatch int64
edrefservice int64
georgm8 commented 1 year ago

Thanks - you're absolutely right - I forgot to remove the Nullable Integer data type I was testing out earlier. No error with int64 data types. Closing.