Closed binomaiheu closed 5 months ago
I propose to rescale the "Gaussian-like" dsitribution to a uniform distribution like the rest, assuming that the data in there does hold some value.
Code:
import scipy.stats as stats
train_V2["score5_neg_uniform"] = ( train_V2["score5_neg"] - train_V2["score5_neg"].mean() ) / train_V2["score5_neg"].std()
train_V2["score5_neg_uniform"] = stats.norm.cdf(train_V2["score5_neg_uniform"])
So, first subtracting the mean & divide by std to have a standardised z-score, then take the CDF, that should yield a uniform distribution. Some plot :
fig, axs = plt.subplots(2,2, figsize=(12,12))
train_V2["score5_neg"].hist(ax=axs[0][0])
train_V2["score5_neg_uniform"].hist(ax=axs[0][1])
axs[1][0].plot(train_V2["score5_neg"], train_V2["score5_pos"], '.')
axs[1][1].plot(train_V2["score5_neg_uniform"], train_V2["score5_pos"], '.')
axs[0][0].set_title("score5_neg histogram")
axs[0][1].set_title("score5_neg_uniform histogram")
axs[1][0].set_title("score5_neg vs score5_pos")
axs[1][1].set_title("score5_neg_uniform vs score5_pos")
@Marijkevandesteene , @dluts, what do you think ?
We can then calculate the mean scores and include that in the model ?
train_V2["score_pos"] = train_V2[["score1_pos", "score2_pos", "score3_pos", "score4_pos", "score5_pos"]].mean(axis=1)
train_V2["score_neg"] = train_V2[["score1_neg", "score2_neg", "score3_neg", "score4_neg", "score5_neg_uniform"]].mean(axis=1)
That yields :
And leads to
which is already a lot better than before. Question is how to deal with the remaning nan's. I propose simply to impute a score of 0.5 (mean) for each ?
On the other hand, there doesn't seem to be much correlation with the outcome variables either, so not sure if this is all worth the effort.
Nu ik erover nadenk, stond er niet "given as quantile" in die dictionary, dus dat rechtvaardigt dit dan wel. Mogelijks nog niets beter met ecdf ipv aanname van Gaussische verdeling, maar soit.
Idee van @dluts:
add_indicator
van de SimpleImputer
voor reconstrutie van missings --> maar beter nieuw features introduceren. uit de discussie :
Interesting, score5_neg ziet er heel anders uit. Ik ben aan het zien om die score variabelen uit te middelen als gemiddelde over de 5 hotels heen, dan hebben we minder missing values, maar eerst moet deze er dan uit :