replace_values() throws TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64

georgm8 commented 1 year ago

replace_values() function throws the error TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 as it is trying to replace values in a column with the string value in the variable other in instances where the Pandas Series data is not a string.

Quick fix suggested is to change the series to a string and also replace dictionary keys with strings.

# emergency_care_features.py
def replace_values(
    data: pd.Series, replacements: dict, other: str = "ERROR:Unmapped - Not In Refset"
) -> pd.Series:
    # if value is in replacements, keep the value, else use `other` for all others
    # then use replacements to assign the other categories

    # Convert the replacements dictionary to strings and data type to str to allow replacement by other 
    replacements_str = {str(k): v for k, v in replacements.items()}
    data = data.astype(str)

    data_cat = (
        # data.where(data.isin(replacements), other).replace(replacements).astype(str)
        data.where(data.isin(replacements_str), other).replace(replacements_str).astype(str)
    )

    return data_cat

georgm8 commented 1 year ago

Pull request #46

vvcb commented 1 year ago

Thanks for reporting this @georgm8 . The data expected in this column are SNOMED codes which are integers rather than strings. feature_maps.py generates the map between the SNOMED codes as integers and string categories.

Not sure why that error appears. Looking at this on my phone at the moment. Will check this evening and merge.

Regarding the SNOMED codes for missing data, I agree that they should go in feature_maps along with 0 which is already in there I think.

vvcb commented 1 year ago

@georgm8, I have rerun v0.3.1 on the LTH data and don't get this error. The following is the truncated output of good.dtypes after the first validation. There shouldn't really be any Int64 dtypes unless you are coercing columns into this in a previous step. Is it possible that this may have been introduced to allow nan in SNOMED columns instead of assigning 0 or one of the allowed values for missing or unknown values.

Can you please check and close this issue if this explains it?

Also please see https://github.com/pandas-dev/pandas/issues/45729.

column	dtype
patient_id	int64
visit_id	int64
townsend_score_quintile	int64
gender	object
activage	int64
ethnos	object
accommodationstatus	int64
procodet	object
edsitecode	object
eddepttype	object
edarrivalmode	int64
edattendcat	object
edattendsource	int64
edarrivaldatetime	datetime64[ns, UTC]
edwaittime	float64
edacuity	int64
edchiefcomplaint	int64
edcomorb_01	int64
eddiag_NN	int64
edentryseq_NN	int64
eddiagqual_NN	int64
edinvest_NN	int64
edtreat_NN	int64
timeined	float64
disstatus	int64
edattenddispatch	int64
edrefservice	int64

georgm8 commented 1 year ago

Thanks - you're absolutely right - I forgot to remove the Nullable Integer data type I was testing out earlier. No error with int64 data types. Closing.

LTHTR-DST / hdruk_avoidable_admissions

replace_values() throws TypeError: Invalid value 'ERROR:Unmapped - Not In Refset' for dtype Int64 #45