Closed dluts closed 19 hours ago
The multivariate analysis also refers to the conclusion from the univariate analysis, but there is no cear conclusion.
Actually the multivariate analysis conclusion could also be a bit more clear explaining what the results mean.
Within the univariate outlier detectiontwo records are dropped, but no clear explanation is given:
train_V2.drop(1979, inplace=True) train_V2.drop(3763, inplace=True)
In general this section seems a bit light on commentary why we keep the other features.
Good point, will put in some more comments there, will be in upcoming PR.
Put in :
Jumping ahead a little, we tried training the gradient boosting machine for the profit with & without these outliers. When leaving out these datapoints the R2 score for the test set did improve noticeably (from 0.80 -> 0.84 roughly) however, the question still remains whether those really are outliers, and whether only juding from the R2 score this is the best strategy, afterall a few outliers which do not end up on the y = x
line can significantly impact the R2, also the profit values quoted in the dataset don't seem wrong, and judging from the other features for those guests, such as income and profit_las, profit_am, those really are very high profitable guests. Finally we decided to leave out those 2 entries based on the improvement in R2, but this could definitely be explored further, especially in the context of predicting the damage amount which did not seem to go very well. Unfortunately, 3 working class people here on the other side...
Added to the consolidated notebook for dropping outliers as commented by Bino (+ referred to this ticket)
Added extra information on multivariate analysis (+ referred to this ticket)
In addition to the univariate analysis above, we also looked at an unsupervised technique for outlier detection, namely isolation forests
For the isolation forest we used data from the training set and from the scoring set. In this way we wish to prevent that conclusions are drawn on the training set which do not apply for the scoring set.
The output of an Isolation Forest model typically includes the anomaly score for each data point. The anomaly score is a measure of how different or isolated a data point is compared to the rest of the data.
According to the isolationForest more than 10% of the data is an outlier. We believe that this percentage is to high looking at the data.
Therefore we believe that this confirms the conclusion from the univariate analysis, nl. that there are no outliers.
However whether or not to include specific outliers in training the GBM indicated that excluding 2 outliers improved peformance considerably.
https://github.com/Marijkevandesteene/MachineLearning/issues/41
Added to the last version
Within the univariate outlier detectiontwo records are dropped, but no clear explanation is given:
In general this section seems a bit light on commentary why we keep the other features.