This paper presents a system for performing online model maintenance using stochastic gradient descent. To improve performance, the paper caches materialized features for re-computation.
Strong Points
S1 The problem of model serving is increasingly important.
S2 The paper describes and implements a real system.
S3 The paper evaluates using real data.
Weak Points
W1 The paper is not particularly novel; basically performs online learning via SGD.
W2 Several aspects of the experimental setup require improvement.
W3 The paper evaluates on one model and dataset.
Detailed Evaluations
D1 While this paper makes the valuable observation that online SGD can improve model quality, I found it short on useful, interesting, or surprising insights. In essence, the main proposal is to perform SGD at runtime to improve model quality. This strategy is standard in the ML literature, where this problem is often called "online learning" and is the subject of hundreds of papers, many of which are practical (e.g., "Identifying Suspicious URLs: An Application of Large-Scale Online Learning" ICML 2009). In fact, the predecessor to many adaptive gradient methods, including Adadelta, was developed in this context (cf. "Adaptive subgradient methods for online learning and stochastic optimization" by Duchi et al. JMLR 2011). This is to say: the use of SGD to update models far pre-dates the parameter server, and the problem of online updates to models, especially under non-stationary distributions, has a long history. Not only does this paper not cite or directly acknowledge this body of work, it does not substantially build upon it.
D2 More specifically, I don't expect a SIGMOD paper to innovate in learning theory; this isn't the conference for doing so. However, in building an end-to-end system that utilizes online learning, I'd expect more lessons, design principles, or algorithmic innovations beyond just "run Adam". The feature materialization in retraining is a start, but the main contributions of the paper deserve further developing before this paper is ready for prime-time. As-is, the paper felt thin on depth -- somewhat valuable as an end-to-end system implementing these methods -- but not sufficiently above the bar or interesting to merit acceptance to the research track.
D3 Beyond the above conceptual issues, I have a number of concerns about the experimental evaluation (W2 and W3):
-- The decision to use two days to train and the sixth day to test is problematic, on at least two counts:
i) Figure 6 suggests major concept drift between train and test, which is corroborated by the extreme growth in the number of features. I suspect the distribution is non-stationary, and that day six's data simply contains more features than day one and two, and so the eval is simply measuring: "what percentage of the full feature set are you observing?" A more reasonable comparison here would be to compare, periodically, the CTR prediction accuracy for the next N clicks in the sequence (for some small N). Beyond fairness, this suggested incremental evaluation metric is more representative of a production environment for proactive training (i.e., in which, at every point, the model is evaluated on the next set of clicks, rather than clicks from several days in the future).
ii) The models are initialized after 500 iterations of SGD; why? At the minimum, the paper should vary the interval of time that has elapsed (e.g., four days, five days). I imagine that the impact of proactive training (especially in light of the suspected drift above) would be substantially affected by a longer warm-up time. Moreover, I suspect the apparent relative stationarity as hinted in production systems such as TFX is reflective of the fact that these industrial systems are training on months to years of data, not 500 mini batches; undoubtedly, concepts drift, but likely much slower.
-- The paper should use a stronger baseline for its model quality (Figure 10). Ostensibly, there is a parameter setting (Periodical3000?) for which batch training outperforms incremental SGD. If not, then is the paper simply making the case that, for non-stationary distributions, training over a sliding window is better than training over all of the historical data? Undoubtedly, my preceding proposal will be slow: the point is that the reader should not have to guess about the accuracy cost of online learning -- it should be quantitatively evaluated.
-- The paper evaluates one model (logistic regression) on one dataset (Criteo). To demonstrate generality and explore the trade-off space in greater depth, the paper should evaluate more model types (e.g., SVM with kernels, collaborative filtering, deep nets), and additional datasets (e.g., Netflix, something synthetic to vary concept drift).
-- The paper should more clearly explain what stages of the proposed Criteo pipeline most benefit from caching; everything but parsing seems like it should be pretty fast.
-- The paper uses Spark Streaming as an execution engine, which incurs 23-53+ second delays in training. It would be valuable to quantify the accuracy cost of this slow retraining (e.g., compared to running SGD on a parameter server-style architecture, or an online system like Clipper or Velox).
Small notes:
-- Velox caches feature computation for inference and should be cited as such. However, unlike this paper, Velox does not cache feature computation for training.
-- The paper should report the total number of data points for the Criteo dataset per day.
To Do
[ ] Cite Velox for feature caching (although they are caching the features for inference and not training)
[ ] Report number of data points
[ ] Use different models and different datasets (SVM with Kernels, collaborative filtering, ...)
[ ] Why caching is useful in the Criteo pipeline, I have to make that more clear
[ ] Spark Streaming incurs delays in training, what are the implications of this on the accuracy
[ ] Use stronger baselines
[ ] Why train on day 1 and 2, and test on day 6? It is more logical to test on N next clicks
[ ] In building an end-to-end system that utilizes online learning, I'd expect more lessons, design principles, or algorithmic innovations beyond just "run Adam"
[ ] SGD is a very well established method and it seems the paper doesn't cite either build upon the existing work. I have to make it clear why this is not a simple SGD and how is it different from many of the existing works. Specifically, in online learning.
[ ] Read "Adaptive subgradient methods for online learning and stochastic optimization" by Duchi et al. JMLR 2011
[ ] "Identifying Suspicious URLs: An Application of Large-Scale Online Learning" ICML 2009
Summary
This paper presents a system for performing online model maintenance using stochastic gradient descent. To improve performance, the paper caches materialized features for re-computation.
Strong Points
S1 The problem of model serving is increasingly important. S2 The paper describes and implements a real system. S3 The paper evaluates using real data.
Weak Points
W1 The paper is not particularly novel; basically performs online learning via SGD. W2 Several aspects of the experimental setup require improvement. W3 The paper evaluates on one model and dataset.
Detailed Evaluations
D1 While this paper makes the valuable observation that online SGD can improve model quality, I found it short on useful, interesting, or surprising insights. In essence, the main proposal is to perform SGD at runtime to improve model quality. This strategy is standard in the ML literature, where this problem is often called "online learning" and is the subject of hundreds of papers, many of which are practical (e.g., "Identifying Suspicious URLs: An Application of Large-Scale Online Learning" ICML 2009). In fact, the predecessor to many adaptive gradient methods, including Adadelta, was developed in this context (cf. "Adaptive subgradient methods for online learning and stochastic optimization" by Duchi et al. JMLR 2011). This is to say: the use of SGD to update models far pre-dates the parameter server, and the problem of online updates to models, especially under non-stationary distributions, has a long history. Not only does this paper not cite or directly acknowledge this body of work, it does not substantially build upon it.
D2 More specifically, I don't expect a SIGMOD paper to innovate in learning theory; this isn't the conference for doing so. However, in building an end-to-end system that utilizes online learning, I'd expect more lessons, design principles, or algorithmic innovations beyond just "run Adam". The feature materialization in retraining is a start, but the main contributions of the paper deserve further developing before this paper is ready for prime-time. As-is, the paper felt thin on depth -- somewhat valuable as an end-to-end system implementing these methods -- but not sufficiently above the bar or interesting to merit acceptance to the research track.
D3 Beyond the above conceptual issues, I have a number of concerns about the experimental evaluation (W2 and W3): -- The decision to use two days to train and the sixth day to test is problematic, on at least two counts: i) Figure 6 suggests major concept drift between train and test, which is corroborated by the extreme growth in the number of features. I suspect the distribution is non-stationary, and that day six's data simply contains more features than day one and two, and so the eval is simply measuring: "what percentage of the full feature set are you observing?" A more reasonable comparison here would be to compare, periodically, the CTR prediction accuracy for the next N clicks in the sequence (for some small N). Beyond fairness, this suggested incremental evaluation metric is more representative of a production environment for proactive training (i.e., in which, at every point, the model is evaluated on the next set of clicks, rather than clicks from several days in the future). ii) The models are initialized after 500 iterations of SGD; why? At the minimum, the paper should vary the interval of time that has elapsed (e.g., four days, five days). I imagine that the impact of proactive training (especially in light of the suspected drift above) would be substantially affected by a longer warm-up time. Moreover, I suspect the apparent relative stationarity as hinted in production systems such as TFX is reflective of the fact that these industrial systems are training on months to years of data, not 500 mini batches; undoubtedly, concepts drift, but likely much slower. -- The paper should use a stronger baseline for its model quality (Figure 10). Ostensibly, there is a parameter setting (Periodical3000?) for which batch training outperforms incremental SGD. If not, then is the paper simply making the case that, for non-stationary distributions, training over a sliding window is better than training over all of the historical data? Undoubtedly, my preceding proposal will be slow: the point is that the reader should not have to guess about the accuracy cost of online learning -- it should be quantitatively evaluated. -- The paper evaluates one model (logistic regression) on one dataset (Criteo). To demonstrate generality and explore the trade-off space in greater depth, the paper should evaluate more model types (e.g., SVM with kernels, collaborative filtering, deep nets), and additional datasets (e.g., Netflix, something synthetic to vary concept drift). -- The paper should more clearly explain what stages of the proposed Criteo pipeline most benefit from caching; everything but parsing seems like it should be pretty fast. -- The paper uses Spark Streaming as an execution engine, which incurs 23-53+ second delays in training. It would be valuable to quantify the accuracy cost of this slow retraining (e.g., compared to running SGD on a parameter server-style architecture, or an online system like Clipper or Velox).
Small notes: -- Velox caches feature computation for inference and should be cited as such. However, unlike this paper, Velox does not cache feature computation for training. -- The paper should report the total number of data points for the Criteo dataset per day.
To Do