Q2: Why do we need a distributed database in the first place?

cristiprg commented 7 years ago

@jeyhunkarimov From what I understand, we have to maintain a state of the ML algorithms. By state I understand something similar to a model that we maintain and update based on the training data:

Training data ----> Training procedure ----> Model / State.

Then we use this state for prediction:

Query point ----> Apply model -----> prediction.

Now, this model is usually very very small in size (compared to the amount of training/testing data). For example, the model class MultilayerPerceptronClassificationModel [1] that the MultilayerPerceptronClassifier [2] creates after training, contains some "layers" and some "weights", i.e. some ints, absolutely no need for a complicated/distributed data base.

This is why I don't see the need for a system such as Redis. I'm somehow sure that there are some errors in my reasoning, can you point out if I got something wrong?

Edit: Another example, the StreamingLinearAlgorithm has a trainOn method [3], where the model is updated. Is it worth getting the model from Redis here?

[1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L291

[2] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L234

[3] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala#L88

jeyhunkarimov commented 7 years ago

@cristiprg

As you mentioned: Training data ----> Training procedure ----> Model / State. Here the training procedure is blocking factor to update the model(state). So, if you have say 100 parallel instances running, and 99 of them is finished and 1 not, then the model(state) will not be updated and the remaining 99 instances will wait for model to be updated in order to begin the next iteration of training procedure. We want to speed up this and remove the blocking factor. Therefore, async. update of model is a solution for this.
Currently big systems concentrate continuously training and serving. So training and serving are not separated from each other. This is especially useful in streaming scenarios.

cristiprg commented 7 years ago

@jeyhunkarimov

Currently big systems concentrate continuously training and serving. So training and serving are not separated from each other. This is especially useful in streaming scenarios.

I'm not sure I understand what you mean by training and serving not being separated from each other. Is it that all the design decisions take into account both fast training and fast serving? So the system has to be able to process rapidly training and query points (indeed, this can imply sacrificing some consistency). I can see why this is useful in streaming applications.

jeyhunkarimov commented 7 years ago

@cristiprg Hope this helps: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf

My second point is out of the scope of this project. I noted it to give motivation that there is a need for such systems.

jeyhunkarimov commented 7 years ago

@cristiprg https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html

cristiprg / BDAPRO.GlobalStateML

Q2: Why do we need a distributed database in the first place? #7