SIGMOD 2018 - Reviewer 4

Summary

This paper explores the challenge of supporting online learning in a serving system. The key contributions of this paper are providing a good outline of many (but not all) of the key issues around online learning. They then describe a reasonable set of solutions to these issues and an instantiation of the solution using Apache Spark. They evaluate the system on the first few days of the Criteo ad prediction benchmark. They show that their system is able to achieve a 1.6% improvement in accuracy.

Strong Points

S1: The paper is well written and does an excellent job of framing many of the key issues around online training for logistic regression models

S2: They propose a reasonable set of solutions using relatively established techniques to address issues in online learning.

S3: They provide a relatively detailed system evaluation using the Criteo ad prediction benchmark.

Weak Points

W1: They don't provide much isolation between training and serving and even suggest that serving would be unavailable during training.

W2: They don't address the issue that it might be better to adapt the input rather than the model. Online learning sounds appealing but if the world is changing quickly enough to benefit from continuous training than it is often better to model the change directly and retain a static model. (See detailed eval for more details).

W3: They only evaluate on the first few days of the Criteo benchmark which suggests that the improvements they are seeing should be largely due to improved model fitting and not temporal variation. They also have an odd testing procedure that essentially assumes stationarity. If the goal is to learn quickly in a stationary setting than the value of the system should quickly diminish.

Detailed Evaluation

D1: In the advertising scenario, why would we need to retrain the model? If it is to adapt to changes in click behavior. For example, one feature might include how often the ad was clicked in the past hour or what category of ads was clicked in this last hour. These "dynamic" features allow the model to "learn temporal dynamics" instead of simply responding to them. More broadly, there is a critical tension in the context of online learning. If the world is changing very quickly then learning without modeling temporal dynamics (learning the trends as well as the current state) is suboptimal and if the world is changing slowly than continuous training has marginal value given enough data.

D2: What is meant by "Individual iterations of SGD are independent and typically lightweight."? Each iteration is dependent on the last ...

D3: (minor issue) The equation for w^* should be arg max (since this is the MLE). It is also common to rewrite \log\prod to \sum\log.

D4: Logistic regression is typically applied with some form of regularization and the corresponding regularization parameter would also need to be determined. How is this set in a dynamic setting?

D5: The selection of hyper-parameters and during offline training seems a bit incompatible with the use of feature transformations like one-hot-encoding, which will introduce new dimensions as new words or categories emerge. What learning rates would be used for these new dimensions? A common strategy to deal with categorical features is to apply the hashing trick and map the feature to one of k dummy features. This might simplify the design of the system by eliminating the need to grow the model as new categories appear.

D6: Model stability: How do you construct the evaluation set? How do we know the evaluation set is reflective of the current state of the world?

D7: The calculation for when to run model updates seems a bit odd. Shouldn't model updates be applied when there is a sufficiently large gradient in the loss? The use of timing for training frequency as an isolation mechanism also seems a bit odd. If the goal is isolation how does this ensure that training doesn't interfere with a burst of queries? Furthermore, latency in the prediction component is usually a serious issue (especially in ad systems). Wouldn't it make sense to be able to preempt training?

D8: This sounds like a bad idea: " Moreover, the scheduler assumes that the entire resources of the computing cluster are being used by the proactive trainer and therefore the prediction answering component is completely blocked while the proactive training is being executed." This means you are not making money on ads while training to get a 1.6% improvement in ad accuracy. Moreover, I like the system decomposition but I really think there needs to be better isolation between the training and inference components. Shutting down inference to do additional marginal re-training is slightly suboptimal ...

D9: The temporal biased sampling is interesting but it's not clear how you would adjust the temporal bias. What happens if something changes and you need to quickly forget the past. How does the system decide this on the fly?

D10: The experimental setup is a bit odd. While the Criteo data does have "day" structure but it doesn't guarantee that data was sampled uniformly throughout the day and unfortunately they don't provide time stamps (only a guarantee of chronological sampling). Furthermore, the authors only explore 3 days. Why not show the improvements in accuracy over the 24 day period (or at least the first 21 training days).

D11: Why train on day 1 and day 2 and measure accuracy on day 6? Given this is is an online learning problem accuracy really should measure as a running average of the error on the next prediction.

D12: Why not standardize the categorical variables? You could do this by modifying the model without actually subtracting the mean value from the one hot encoding.

D13: For the offline training all the algorithms should be tuned to minimize the loss. This is a convex problem and assuming appropriate regularization is used there is an optimal answer. Before trying to improve the model on new data wouldn't it be best to start with the best offline model? Figure 5 makes me doubt the implementations of the other algorithms ...

D14: The authors caution against embracing Adam as the best online optimization algorithm. Why? In my experience, Adam is the most robust method for these convex problems. Why does the user need to evaluate all the other techniques? Is there evidence to suggest that this is necessary?

D15: The following statement is a bit confusing: "For Criteo pipeline, the dataset has a stable distribution, which stays the same throughout the course of the experiment. As a result, limiting the training to the more recent data exposes the model to newer and unseen data which results in bigger changes (toward convergence) in the weights of the model (e.g., no sampling)." What is meant by a "stable distribution"? Is this a stable joint distribution or marginals? Is the implication that there is no concept drift or covariate shift in the Criteo data? If this is the case we would expect training over the entire history to perform best. Alternatively, if there is really very short-term substantial concept drift (as is concluded from figure 6) then training on days 1 and 2 and then testing on day 6 would be a bad idea and instead, you would want to evaluate the techniques using the online learning loss.

To Do

[ ] D7 raises the issue of surges and bursts in queries. This point should be addressed as it is a very common case in real-world applications.
[ ] D8 seems to be a misunderstanding, I should improve the text
[ ] D9 is an interesting point that I have failed to address
[ ] D10: use more data (entire 24 days)
[ ] In D11, similar to other reviewers I have to change the accuracy computation to use prequential or similar approaches to prequential
[ ] D13 seems to be a misunderstanding, although the reviewer makes the interesting point that should show the model I'm deploying is the best possible offline model. Even though I am doing this in my experiments. I have not clearly shown nor have I mentioned it in the text
[ ] For D14, I actually do think that ADAM is not necessarily the best option all the time. I don't think this is something that has to be addressed separately. Maybe if I use multiple models and dataset, then other learning rate tuning methods may perform better than Adam
[ ] D15 should be fixed by changing the accuracy computation to prequential or other appropriate methods in online scenarios.
[ ] Regularization is important and have to be explicitly addressed. Maybe I can perform similar experiments as I did for learning rate adaptation techniques.
[ ] Check D3 in the paper and fix it
[ ] D2 has a valid point. There is nothing wrong there, I just have to change the wording a bit.

TU-Berlin-DIMA / continuous-pipeline-deployment