Open sbjelogr opened 4 years ago
Many thanks for your review, Sandro. Please, see some answers below:
Analysis Questions:
Q: Why is the shape of the accepted and rejected dataset so different? The rejected dataset has only 9 columns. Are those the features you have in the accepted population? A: Yes, the observation is correct. The accepted population also contains characteristics related to behavioral information that can only be obtained after a customer is accepted. That's why I used only variables that are common between the accepted and rejected datasets.
Q: Why are you undersampling the data? Any specific reason? A: Because the data is unbalanced (defaulted customers are much less than non-defaulted ones)
Q: In cell [35], you are probably more interested the distribution rather than the actual counts. A: Agree, will fix
Q: In general, do not rely on the .score method in the sklearn classifiers. A: Agree, will fix
Q: Any specific reason you chose the LGBMRanker? A: Yes! Prior research reported good results with the LightGBM. However, I understand that LGBM can be used as classifier, regression or ranker. The studies before used it as classifier. I didn't find related literature that used any learning to rank methods, including LGBM, on RI. I thought it can be a nice experiment. It would be nice to discuss this idea together.
The Python-related comments are very helpful - I'll aim to implement them over the next weeks.
General comments
Currently the notebook is a bit difficult to read.
Notebooks allow for markdown cells, where you can clearly explain what and why you are doing it. This will help you, as well as an external reader like me, to follow the logic of the work
Analysis questions/comments
Why is the shape of the accepted and rejected dataset so different? The rejected dataset has only 9 columns. Are those the features you have in the accepted population?
Why are you undersampling the data? Any specific reason?
In cell [35], you are probably more interested the distribution rather than the actual counts.
You can get the distribution statistics by calling
test_pred2.describe()
..score
method in the sklearn classifiers. This by default will print the accuracy, which needs to be treated carefully in an imbalanced setting.roc_auc_score
orf1_score
are better options.any specific reason you chose the
LGBMRanker
?Python related comments
use of
elif
I see often a syntax as follows:
Note that that in python you can use the
elif
statement (which stays for "else if"), which allows you to rewrite the upper code as follows (no much nesting going on, it's easier to read)see more details here.
piping data in pandas
I notice that you are doing a lot of transformations on the dataframes, and as consequence you have a lot of names for different dataframes, like
small_accept
,df1
,df2
etc. Pandas has a handy function called.pipe
that could help you with it. Have a look at this talk from pydata Eindhoven. The link starts at minute 13 of the video, where Vincent introduces the need for the.pipe
function and explains it very well.statsmodels and sklearn LogisticRegresssion.
I would recommend that you keep the intercept in the fit method. This can only improve your model performance :) In the sklearn it's an easy switch (
fit_intercept=True
). For statsmodels, you need to add a new column to the dataset, ie you need to use sm.add_constantvariable naming
please note that you re-use the names for different python objects. For example, in the first part of the notebook,
binning
is a data frame, while around cell 20 you introduce a function namedbinning
. This is definitely not a good practice for obvious reasons :)usage of global variables within functions
I see that some functions use variables define in other cells. See the example in cell 59.
The variable
dfr_dev3
is not defined within the function, which might lead to some unwanted behaviours.