evanwang1990 / FMwR

FMwR: A Library of Factorization Machines in R Based on libfm
GNU General Public License v3.0
7 stars 3 forks source link

[question] solver for large problem #3

Closed dselivanov closed 7 years ago

dselivanov commented 7 years ago

This is not an issue, but a more a question. I'm trying to solve large, but very sparse problem. Number of features ~ 3 - 8 millions and number of samples > 50 millions. Honestly speaking, I'm working on this: https://kaggle.com/c/outbrain-click-prediction. I tried several solvers and seems the only suitable is vanilla SGD. All other are much slower. Is it true? I have feeling that FTRL should be at least same order of speed, but seems it is much slower (~10x). Is it true? Which solver can you suggest? Which solvers support updates? (for the first look seems they all support "mini-batch" update)

evanwang1990 commented 7 years ago

I think FTRL and TDAP solver's convergence speed is faster than SGD, that means they need less iterations to get the same performance than SGD while they will cost much more time to update parameters every iteration than SGD. I use fwmr_segfault.rds to have a test, SGD needs about 3e6 iterations to get 0.76 of AUC while FTRL just needs 2e5 iterations.

control_lst = list(track.control(step_size = 5000000),
                   solver.control(solver =SGD.solver(random_step = 10, learn_rate = 1e-5)
                                  ,max_iter = 500000000),
                   model.control( factor.number = 2, v.init_stdev = 0.01))

control_lst = list(track.control(step_size = 2000),
                   solver.control(solver =FTRL.solver(random_step = 10)
                                  ,max_iter = 200000),
                   model.control( factor.number = 2, v.init_stdev = 0.01))
dselivanov commented 7 years ago

Thank you very much for advise. I have feeling that in my case algorithm will benefit more from seeing more data. But I can be wrong. Can you also give me an idea what is the random_step parameter? and other parameters I guess that for FTRL alpha_w = 0.1 alpha_v = 0.1 beta_w = 1.0 beta_v = 1.0 reflect learning rates and "beta" parameter for original FTRL regression, but for feature interactions and features. alpha_w = 0.1 alpha_v = 0.1 are learning rates for TDAP solver and gamma is "time" decay ?

evanwang1990 commented 7 years ago

the row number of observation used in the (i+1)th iteration R{i+1} = R{i} + t, t is randomly selected from [1,random_step]

dselivanov commented 7 years ago

Closing this, thanks for help. Will provide some feedback soon. Really great to have such package in R.