andrewhercules / date-a-scientist

Capstone project for Codecademy's Machine Learning Fundamentals course
0 stars 1 forks source link

On cleaning the data #1

Closed alexander-dubinski closed 5 years ago

alexander-dubinski commented 5 years ago

Often the data you will receive in projects like this is not perfect and some cleaning needs to be done. However, extreme caution needs to be heeded when deleting records from the data. If you have 1 million entries and 50 records are too incomplete to be useful, deleting these records will have almost no effect on your ML outcomes. If you are deleting half of the available data because only one or a couple features are missing, it puts the unbiased nature of the data at risk. Every data set is different so I won't go into detail but instead point out that deleting records should be a last resort when cleaning data because of the issues it can cause in analysis. If a necessary feature is missing from a record, use the mean or median of all other records in the data set as a placeholder. This will usually not effect the analysis much or at all depending on the data.

andrewhercules commented 5 years ago

Thank you for the tip @addubinski - it's much appreciated! I'm going to build another notebook to answer other questions I had about the data and I will definitely keep this in mind! :-)