General comments

Currently the notebook is a bit difficult to read.
Notebooks allow for markdown cells, where you can clearly explain what and why you are doing it. This will help you, as well as an external reader like me, to follow the logic of the work

Analysis questions/comments

Why is the shape of the accepted and rejected dataset so different? The rejected dataset has only 9 columns. Are those the features you have in the accepted population?
Why are you undersampling the data? Any specific reason?
In cell [35], you are probably more interested the distribution rather than the actual counts.
```
# Predictions on testset
test_pred = logreg.predict_proba(X_test_2)[:, 1]
test_pred2 = pd.DataFrame(data=test_pred, columns=["prediction"])
test_pred2["count"] = test_pred2.groupby("prediction")["prediction"].transform("count")
test_pred2.groupby(["prediction"]).count()
```
You can get the distribution statistics by calling test_pred2.describe().
- in general, do not rely on the .score method in the sklearn classifiers. This by default will print the accuracy, which needs to be treated carefully in an imbalanced setting. roc_auc_score or f1_score are better options.
any specific reason you chose the LGBMRanker?

Python related comments

use of `elif`

I see often a syntax as follows:

if condition:
    do something
else: 
    if condition_2:
       ... do something else
   else:
         do nothing

Note that that in python you can use the elif statement (which stays for "else if"), which allows you to rewrite the upper code as follows (no much nesting going on, it's easier to read)

if condition:
    do something
elif condition_2:
    do something else
else:
    do nothing

see more details here.

piping data in pandas

I notice that you are doing a lot of transformations on the dataframes, and as consequence you have a lot of names for different dataframes, like small_accept, df1, df2 etc. Pandas has a handy function called .pipe that could help you with it. Have a look at this talk from pydata Eindhoven. The link starts at minute 13 of the video, where Vincent introduces the need for the .pipe function and explains it very well.

statsmodels and sklearn LogisticRegresssion.

I would recommend that you keep the intercept in the fit method. This can only improve your model performance :) In the sklearn it's an easy switch (fit_intercept=True). For statsmodels, you need to add a new column to the dataset, ie you need to use sm.add_constant

X_in = sm.add_constant(X)
logit_model = sm.Logit(y, X_in)
result = logit_model.fit()

variable naming

please note that you re-use the names for different python objects. For example, in the first part of the notebook, binning is a data frame, while around cell 20 you introduce a function named binning. This is definitely not a good practice for obvious reasons :)

usage of global variables within functions

I see that some functions use variables define in other cells. See the example in cell 59.

def predictions1(model, treshold):
    # Join predictions with train new
    pred = model.predict_proba(dfr_dev3)[:, 1]
    pred2 = pd.DataFrame(
        data=pred,
        columns=["prediction2"],
        index=dfr_dev2.index.copy(),
    )
    pred2["prediction_beforeRI"] = pred2["prediction2"].apply(
        lambda x: 0 if (x < treshold) else 1
    )
    outcome = pd.merge(
        dfr_dev2,
        pred2[["prediction_beforeRI"]],
        how="inner",
        left_index=True,
        right_index=True,
    )
    # pred_test1.dropna(subset=["prediction_beforeRI"], inplace=True)
    outcome = outcome[["id", "prediction_beforeRI"]]
    return outcome

The variable dfr_dev3 is not defined within the function, which might lead to some unwanted behaviours.

Many thanks for your review, Sandro. Please, see some answers below:

Analysis Questions:

Q: Why is the shape of the accepted and rejected dataset so different? The rejected dataset has only 9 columns. Are those the features you have in the accepted population? A: Yes, the observation is correct. The accepted population also contains characteristics related to behavioral information that can only be obtained after a customer is accepted. That's why I used only variables that are common between the accepted and rejected datasets.

Q: Why are you undersampling the data? Any specific reason? A: Because the data is unbalanced (defaulted customers are much less than non-defaulted ones)

Q: In cell [35], you are probably more interested the distribution rather than the actual counts. A: Agree, will fix

Q: In general, do not rely on the .score method in the sklearn classifiers. A: Agree, will fix

Q: Any specific reason you chose the LGBMRanker? A: Yes! Prior research reported good results with the LightGBM. However, I understand that LGBM can be used as classifier, regression or ranker. The studies before used it as classifier. I didn't find related literature that used any learning to rank methods, including LGBM, on RI. I thought it can be a nice experiment. It would be nice to discuss this idea together.

The Python-related comments are very helpful - I'll aim to implement them over the next weeks.

AnnaMariaAngelova / Reject_Inference

Code review #1

General comments

Analysis questions/comments

Python related comments

use of `elif`

piping data in pandas

statsmodels and sklearn LogisticRegresssion.

variable naming

usage of global variables within functions

AnnaMariaAngelova / Reject_Inference

Code review #1

General comments

Analysis questions/comments

Python related comments

use of elif

piping data in pandas

statsmodels and sklearn LogisticRegresssion.

variable naming

usage of global variables within functions

use of `elif`