dselivanov / FTRL

R/Rcpp implementation of the 'Follow-the-Regularized-Leader' algorithm
49 stars 9 forks source link

Add weights parameter as mentioned in section 4.6 "Subsampling Training Data" #3

Open DavidArenburg opened 7 years ago

DavidArenburg commented 7 years ago

First of all, thanks for the great effort- it looks great. The combination of sparseMatrix with Rcpp (instead of Rs memory expensive model.matrix) looks very promising!

Though, as many times mentioned in the paper, in real world we are facing with very sparse data and very small amount of successes, hence, the data is very unbalanced. The normal logistic regression implementation can't handle this (although generating very high accuracy, no TPs will be found), hence, it is crucial to re-balance the data using some type of weights.

In section 4.6 in the paper, they introduced a pretty straight forward implementation of subsampling correction.

dselivanov commented 7 years ago

Hi. I've done this couple of days ago - see #2 . So now partial_fit method contains additional argument for weights. I've tried it myself and seems it works pretty well.

DavidArenburg commented 7 years ago

Great! Can you also update the docs and add an example of how to generate and use the weights? Thanks

dselivanov commented 7 years ago

Idea is to set weights of minor class inverse proportional to major class. For example you have dataset with 1000 examples 10 of which are positive and 990 are negative. I this case generally good idea is to set weight 1 for positive examples and ~0.01 (10/990) to negative examples.

DavidArenburg commented 7 years ago

Yeah, I get that, I just wanted to see an actual code implementation example in the docs

dselivanov commented 7 years ago

Let's keep it open as reminder to update docs.