Getting different results from the ones in article

ibuda commented 4 years ago

Hi, first of all, thank you for a great tool!

I tried reproducing the experiment from jupyter notebooks, but got different results, both for quadratic function and titanic dataset.

For the quadratic function, I am getting 0.67 instead of the value of 0.88 mentioned in the article. Although that discrepancy might have been caused by the randomness of data, the discrepancy I get with Titanic dataset is bigger: in the article, you mention the PPS score between TicketId and TicketPrice of 0.67, whereas reproducing your notebooks, I am getting a score of 0.27.

You can see the steps to reproduce the discrepancy in this notebook, please skip to line 26.

I have python version 3.6.8, sklearn 0.22 and ppscore version 0.0.2, you can see them in the mentioned above notebook.

8080labs commented 4 years ago

Hi Ivan,

thank you for pointing this out. We will have a further look into this and adjust the article.

The main reason for the difference should be the randomness. In the Titanic example, there most likely occurs another split for the crossvalidation sets based on your and my calculation.

In addition, the score seems to be unstable because of the specific relationships:

First, TicketId has many unique values and 60% of the values are unique. In total, only 49 (5%) rows belong to TicketIds which occur 5 times or more. The maximum value count is 7.

The score with the mismatch is the score from TicketPrice to TicketId.

For a given TicketId the TicketPrice is always constant. So, the model can perfectly predict the TicketPrice if it already saw the TicketId before (eg for the 40% of rows which belong to TicketIds that occur more than once). This is also the reason why I was not suspicious of the high score.

In the other direction from TicketPrice to TicketID the relationship is sometimes ambiguous. Here it would be interesting to see how the choice of the crossvalidation splits affect the splits of the DecisionTreeClassifier.

In order to better understand the variability we might need to dig deeper but those are already some first observations. As some next steps it would be interesting to plot the actual predictions of the models during crossvalidation. Also, it is interesting to have a look at how the final F1 is build.

Best, Florian

lucazav commented 4 years ago

In order to guarantee the reproducibility of results, I'm trying to use the random_seed parameter into the matrix function. But I get a matrix having all zeros except for the diagonal (all ones) for any value passed to the parameter.

Am I doing anything wrong?

FlorianWetschoreck commented 4 years ago

Can you please share the full code of your analysis? Then I can have a look at it. Also, which version of ppscore are you using?

lucazav commented 4 years ago

@FlorianWetschoreck my fault! Now it's working like a charm.

FlorianWetschoreck commented 4 years ago

@lucazav happy to hear that :)

ibuda commented 4 years ago

Gentlemen, your conversation triggered my interest to check if the results changed since I last posted here. I upgraded all the modules used in the article and re-ran all the examples. I noticed a new interesting thing, different from the previous execution. I assume, its cause is the same as @FlorianWetschoreck describes, but still, it is worth exploring (IMHO). The coefficient between Class and Survived features is now 0. You cand check the notebook here, in lines 34. Note: This is in no way a critique, on the contrary, an attempt to discuss interesting behavior. As before, very grateful for your contribution to data science community.

FlorianWetschoreck commented 4 years ago

Hi Ivan, in ppscore 1.0 we changed the way ppscore chooses the case (classification, regression) based on the data columns. Currently, Survived (as it is encoded in the titanic dataset) is seen as numeric and thus, ppscore tries a regression. That should be the source of the difference but we will have a look again

ibuda commented 4 years ago

@FlorianWetschoreck Thank you for explaining the difference. I see no reason in keeping this issue open since the discrepancy comes from the difference in the approach between the previous and current version. I think it would be appropriate to update the article with the latest and greatest information.

8080labs / ppscore

Getting different results from the ones in article #4