On cleaning the data - Githubissues

Often the data you will receive in projects like this is not perfect and some cleaning needs to be done. However, extreme caution needs to be heeded when deleting records from the data. If you have 1 million entries and 50 records are too incomplete to be useful, deleting these records will have almost no effect on your ML outcomes. If you are deleting half of the available data because only one or a couple features are missing, it puts the unbiased nature of the data at risk. Every data set is different so I won't go into detail but instead point out that deleting records should be a last resort when cleaning data because of the issues it can cause in analysis. If a necessary feature is missing from a record, use the mean or median of all other records in the data set as a placeholder. This will usually not effect the analysis much or at all depending on the data.

andrewhercules / date-a-scientist

On cleaning the data #1