bojankikumar / Scalpel

0 stars 0 forks source link

Check and remove potential invalid entries #1

Open godfrey-scalpel opened 3 years ago

godfrey-scalpel commented 3 years ago

Quick comments:

You should first check and remove any invalid entries for each feature(s) individually, e.g. it should be obvious that "age", "blood loss (ml)" cannot have negative values and an age of 250 is likely to be invalid as well. You should remove them with NaN (then impute them later) before switching to do further EDA like understanding correlation between different features

you may also check if the timestamp and duration features entries are all valid, i.e. are there entries not of the correct format? E.g. "30:70:90"

godfrey-scalpel commented 3 years ago

Instead of manually checking and replacing invalid entries like this data['age'].replace('-1',np.nan,inplace=True) try writing your own functions (or classes) to automate this checks and "imputation" and can be reused, e.g.

def check_and_remove_invalid_numbers(
        value: int,
        min_allowed: int,
        max_allowed: int
) -> int:
    """
    a simple function that check if an entry is within accepted range and
    replacing it with nan if invalid
    """
    try:
        if max_allowed >= value >= min_allowed:
            return value
        return np.nan
    except TypeError:
        return value

# Apply the function to numerical feature columns, e.g.
data['age'].apply(lambda x: check_and_remove_invalid_numbers(x, 0, 200))

This is a simple illustration how data cleaning or preprocessing typically works in ML projects. The repo I shared with you earlier have some more examples of these

godfrey-scalpel commented 3 years ago

There is an entry T200 in ICD10 column, which could be potentially invalid. You can actually write regex function to check if all ICD10 or OPCS codes are of the correct format, i.e. \[A-Z]\d{2}

You might also try find references online to get the lists of valid codes (or in the repo I shared earlier)