Marijkevandesteene / MachineLearning

repo to share progress and to manage versions of exam MachineLearning (M14)
0 stars 2 forks source link

Univariate outlier detection #41

Closed dluts closed 19 hours ago

dluts commented 2 days ago

Within the univariate outlier detectiontwo records are dropped, but no clear explanation is given:

train_V2.drop(1979, inplace=True)
train_V2.drop(3763, inplace=True)

In general this section seems a bit light on commentary why we keep the other features.

dluts commented 2 days ago

The multivariate analysis also refers to the conclusion from the univariate analysis, but there is no cear conclusion.

dluts commented 2 days ago

Actually the multivariate analysis conclusion could also be a bit more clear explaining what the results mean.

binomaiheu commented 2 days ago

Within the univariate outlier detectiontwo records are dropped, but no clear explanation is given:

train_V2.drop(1979, inplace=True)
train_V2.drop(3763, inplace=True)

In general this section seems a bit light on commentary why we keep the other features.

Good point, will put in some more comments there, will be in upcoming PR.

binomaiheu commented 2 days ago

Put in :

Jumping ahead a little, we tried training the gradient boosting machine for the profit with & without these outliers. When leaving out these datapoints the R2 score for the test set did improve noticeably (from 0.80 -> 0.84 roughly) however, the question still remains whether those really are outliers, and whether only juding from the R2 score this is the best strategy, afterall a few outliers which do not end up on the y = x line can significantly impact the R2, also the profit values quoted in the dataset don't seem wrong, and judging from the other features for those guests, such as income and profit_las, profit_am, those really are very high profitable guests. Finally we decided to leave out those 2 entries based on the improvement in R2, but this could definitely be explored further, especially in the context of predicting the damage amount which did not seem to go very well. Unfortunately, 3 working class people here on the other side...

Marijkevandesteene commented 1 day ago

Added to the consolidated notebook for dropping outliers as commented by Bino (+ referred to this ticket)

Marijkevandesteene commented 1 day ago

Added extra information on multivariate analysis (+ referred to this ticket)

Marijkevandesteene commented 20 hours ago

Multivariate analysis

In addition to the univariate analysis above, we also looked at an unsupervised technique for outlier detection, namely isolation forests

For the isolation forest we used data from the training set and from the scoring set. In this way we wish to prevent that conclusions are drawn on the training set which do not apply for the scoring set.

The output of an Isolation Forest model typically includes the anomaly score for each data point. The anomaly score is a measure of how different or isolated a data point is compared to the rest of the data.

According to the isolationForest more than 10% of the data is an outlier. We believe that this percentage is to high looking at the data. Therefore we believe that this confirms the conclusion from the univariate analysis, nl. that there are no outliers.
However whether or not to include specific outliers in training the GBM indicated that excluding 2 outliers improved peformance considerably.

https://github.com/Marijkevandesteene/MachineLearning/issues/41

dluts commented 19 hours ago

Added to the last version