Open godfrey-scalpel opened 3 years ago
Instead of manually checking and replacing invalid entries like this data['age'].replace('-1',np.nan,inplace=True)
try writing your own functions (or classes) to automate this checks and "imputation" and can be reused, e.g.
def check_and_remove_invalid_numbers(
value: int,
min_allowed: int,
max_allowed: int
) -> int:
"""
a simple function that check if an entry is within accepted range and
replacing it with nan if invalid
"""
try:
if max_allowed >= value >= min_allowed:
return value
return np.nan
except TypeError:
return value
# Apply the function to numerical feature columns, e.g.
data['age'].apply(lambda x: check_and_remove_invalid_numbers(x, 0, 200))
This is a simple illustration how data cleaning or preprocessing typically works in ML projects. The repo I shared with you earlier have some more examples of these
There is an entry T200
in ICD10 column, which could be potentially invalid. You can actually write regex function to check if all ICD10 or OPCS codes are of the correct format, i.e. \[A-Z]\d{2}
You might also try find references online to get the lists of valid codes (or in the repo I shared earlier)
Quick comments:
You should first check and remove any invalid entries for each feature(s) individually, e.g. it should be obvious that "age", "blood loss (ml)" cannot have negative values and an age of 250 is likely to be invalid as well. You should remove them with NaN (then impute them later) before switching to do further EDA like understanding correlation between different features
you may also check if the timestamp and duration features entries are all valid, i.e. are there entries not of the correct format? E.g. "30:70:90"