Open cristiprg opened 7 years ago
@cristiprg
@jeyhunkarimov
Currently big systems concentrate continuously training and serving. So training and serving are not separated from each other. This is especially useful in streaming scenarios.
I'm not sure I understand what you mean by training and serving not being separated from each other. Is it that all the design decisions take into account both fast training and fast serving? So the system has to be able to process rapidly training and query points (indeed, this can imply sacrificing some consistency). I can see why this is useful in streaming applications.
@cristiprg Hope this helps: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf
My second point is out of the scope of this project. I noted it to give motivation that there is a need for such systems.
@jeyhunkarimov From what I understand, we have to maintain a state of the ML algorithms. By state I understand something similar to a model that we maintain and update based on the training data:
Training data ----> Training procedure ----> Model / State.
Then we use this state for prediction:
Query point ----> Apply model -----> prediction.
Now, this model is usually very very small in size (compared to the amount of training/testing data). For example, the model class MultilayerPerceptronClassificationModel [1] that the MultilayerPerceptronClassifier [2] creates after training, contains some "layers" and some "weights", i.e. some ints, absolutely no need for a complicated/distributed data base.
This is why I don't see the need for a system such as Redis. I'm somehow sure that there are some errors in my reasoning, can you point out if I got something wrong?
Edit: Another example, the StreamingLinearAlgorithm has a trainOn method [3], where the model is updated. Is it worth getting the model from Redis here?
[1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L291
[2] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L234
[3] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala#L88