AmandaFranklinRyan / SupervisedMachineLearning

0 stars 0 forks source link

Dealing with outliers #4

Open AmandaFranklinRyan opened 10 months ago

AmandaFranklinRyan commented 10 months ago

I wasn't sure how to handle outliers for the different variables, so I wrote a function illustrating how much data would be removed depending on how we estimated outliers. Here is a plot showing the results for selected variables:

Histograms for selected Variables

Do you think this is a reasonable approach? We can of course easily change the cutoff values, redraw the plots and correct the values in the dataframe

AmandaFranklinRyan commented 10 months ago

Actually I don't think this is important at all, I calculated the metrics for the random forest model, with and without the cleaning and it made no difference at all :)

linanita22 commented 10 months ago

Thanks for having a look if it makes a difference on the predictions! I can imagine that we also have outliers in the test data, so maybe that's the reason why it doesn't change the outcome... how did you define outliers? Q1-1.5IQR and Q3+1.5IQR?