Verwerking van scoreX_pos en scoreX_neg

binomaiheu commented 5 months ago

Interesting, score5_neg ziet er heel anders uit. Ik ben aan het zien om die score variabelen uit te middelen als gemiddelde over de 5 hotels heen, dan hebben we minder missing values, maar eerst moet deze er dan uit :

binomaiheu commented 5 months ago

I propose to rescale the "Gaussian-like" dsitribution to a uniform distribution like the rest, assuming that the data in there does hold some value.

Code:

import scipy.stats as stats

train_V2["score5_neg_uniform"] = ( train_V2["score5_neg"] - train_V2["score5_neg"].mean() ) / train_V2["score5_neg"].std()
train_V2["score5_neg_uniform"] = stats.norm.cdf(train_V2["score5_neg_uniform"])

So, first subtracting the mean & divide by std to have a standardised z-score, then take the CDF, that should yield a uniform distribution. Some plot :

fig, axs = plt.subplots(2,2, figsize=(12,12))
train_V2["score5_neg"].hist(ax=axs[0][0])
train_V2["score5_neg_uniform"].hist(ax=axs[0][1])

axs[1][0].plot(train_V2["score5_neg"], train_V2["score5_pos"], '.')
axs[1][1].plot(train_V2["score5_neg_uniform"], train_V2["score5_pos"], '.')

axs[0][0].set_title("score5_neg histogram")
axs[0][1].set_title("score5_neg_uniform histogram")
axs[1][0].set_title("score5_neg vs score5_pos")
axs[1][1].set_title("score5_neg_uniform vs score5_pos")

@Marijkevandesteene , @dluts, what do you think ?

binomaiheu commented 5 months ago

We can then calculate the mean scores and include that in the model ?

train_V2["score_pos"] = train_V2[["score1_pos", "score2_pos", "score3_pos", "score4_pos", "score5_pos"]].mean(axis=1)
train_V2["score_neg"] = train_V2[["score1_neg", "score2_neg", "score3_neg", "score4_neg", "score5_neg_uniform"]].mean(axis=1)

That yields :

And leads to

1544 remaining NAN's in score_pos
1113 remaning in score_neg

which is already a lot better than before. Question is how to deal with the remaning nan's. I propose simply to impute a score of 0.5 (mean) for each ?

On the other hand, there doesn't seem to be much correlation with the outcome variables either, so not sure if this is all worth the effort.

binomaiheu commented 5 months ago

Nu ik erover nadenk, stond er niet "given as quantile" in die dictionary, dus dat rechtvaardigt dit dan wel. Mogelijks nog niets beter met ecdf ipv aanname van Gaussische verdeling, maar soit.

binomaiheu commented 5 months ago

Idee van @dluts:

Categorische variabele invoeren die aangeeft van welk hotel het gemiddelde afkomstig is, dan wel missing is (desipite imputed via een imputer), dan behouden we de info !
Of/en gebruik maken van de paramete add_indicator van de SimpleImputer voor reconstrutie van missings --> maar beter nieuw features introduceren.

uit de discussie :

nemen gemiddelde van pos & negatief
vullgen aan met simple imputer
creren 6 categorische variabelen (0/1) voor hotelnr of missing

Marijkevandesteene / MachineLearning

Verwerking van scoreX_pos en scoreX_neg #8