Marijkevandesteene / MachineLearning

repo to share progress and to manage versions of exam MachineLearning (M14)
0 stars 2 forks source link

Verwerking van scoreX_pos en scoreX_neg #8

Closed binomaiheu closed 5 months ago

binomaiheu commented 5 months ago

Interesting, score5_neg ziet er heel anders uit. Ik ben aan het zien om die score variabelen uit te middelen als gemiddelde over de 5 hotels heen, dan hebben we minder missing values, maar eerst moet deze er dan uit :

image

binomaiheu commented 5 months ago

I propose to rescale the "Gaussian-like" dsitribution to a uniform distribution like the rest, assuming that the data in there does hold some value.

Code:

import scipy.stats as stats

train_V2["score5_neg_uniform"] = ( train_V2["score5_neg"] - train_V2["score5_neg"].mean() ) / train_V2["score5_neg"].std()
train_V2["score5_neg_uniform"] = stats.norm.cdf(train_V2["score5_neg_uniform"])

So, first subtracting the mean & divide by std to have a standardised z-score, then take the CDF, that should yield a uniform distribution. Some plot :

fig, axs = plt.subplots(2,2, figsize=(12,12))
train_V2["score5_neg"].hist(ax=axs[0][0])
train_V2["score5_neg_uniform"].hist(ax=axs[0][1])

axs[1][0].plot(train_V2["score5_neg"], train_V2["score5_pos"], '.')
axs[1][1].plot(train_V2["score5_neg_uniform"], train_V2["score5_pos"], '.')

axs[0][0].set_title("score5_neg histogram")
axs[0][1].set_title("score5_neg_uniform histogram")
axs[1][0].set_title("score5_neg vs score5_pos")
axs[1][1].set_title("score5_neg_uniform vs score5_pos")

image

@Marijkevandesteene , @dluts, what do you think ?

binomaiheu commented 5 months ago

We can then calculate the mean scores and include that in the model ?

train_V2["score_pos"] = train_V2[["score1_pos", "score2_pos", "score3_pos", "score4_pos", "score5_pos"]].mean(axis=1)
train_V2["score_neg"] = train_V2[["score1_neg", "score2_neg", "score3_neg", "score4_neg", "score5_neg_uniform"]].mean(axis=1)

That yields :

image

And leads to

which is already a lot better than before. Question is how to deal with the remaning nan's. I propose simply to impute a score of 0.5 (mean) for each ?

On the other hand, there doesn't seem to be much correlation with the outcome variables either, so not sure if this is all worth the effort.

binomaiheu commented 5 months ago

Nu ik erover nadenk, stond er niet "given as quantile" in die dictionary, dus dat rechtvaardigt dit dan wel. Mogelijks nog niets beter met ecdf ipv aanname van Gaussische verdeling, maar soit.

binomaiheu commented 5 months ago

Idee van @dluts:

uit de discussie :