Open AmandaFranklinRyan opened 10 months ago
I checked the data again and it seems the sample is balanced so no need to worry about this
This is the boxplot for the original rent data:
Perhaps I'm misinterpreting the boxplot, but I thought this meant the data wasn't imbalanced, but what should we do with the outliers?
For comparison, this is the histogram
:
When do we talk about an unbalanced/balaned dataset?
From the performance of the random forest model, it looks to me like the model isn't working well for the most expensive properties. I was thinking maybe we could use over/undersampling to correct this. I will have a look at this and add tuning parameters to the random forest model too to see if that boosts performance. The R squared is currently around 0.75 I think.