Closed futurecam closed 5 years ago
we encountered the same problem
First, thanks for your question @futurecam .We had tested the the convergence of FTRL for some data sets, and got a good and reasonable result before release it. About trainning async mini-batch FTRL, I have the following suggestions:
Hi, Angels,
We are using angel on spark for large scale wide model training, to serve for CTR prediction, so the feature is very sparse and the objective function is logloss When we shift optimizer to FTRL(before is LBFGS), there exists some convergence problem: 1) training the same dataset with the same parameters two times, we get different result for logloss and auc, and one is diverge at the first epoch;There is no data random shuffle when training and we did not set angel.staleness so I think the default value is 0(BSP) 2) training the same dataset twice but one is using part of features, the one using part of features cannot converge, turning the parameter cannot help
Because of all the case above can converge well when using LBFGS, I think maybe the convergence of FTRL is sensitive to data in distribute learning. I don't know convergence guarantee of ps for FTRL is sufficient or not when Z/N is calculate at worker and update at server? Or we need some other mechanism like trust region method to avoid diverge? Can anyone help?
The FTRL optimizer we use is Async mini-batch FTRL : for ( epochId <- 0 until epoch) { val tempRDD = trainSet.mapPartitions { iter => iter.toArray.sliding(batchSize, batchSize) .map { batch => // ftrl.optimize会更新模型参数 Z 和 N ftrl.optimize(batch, calcGradientLoss) } } tempRDD.count() }