cristiprg / BDAPRO.GlobalStateML

This repository contains my solution to the project "Machine learning algorithms with global state" from the BDAPRO class at TU Berlin. (The repo is based on BDAPRO.WS1617)
Apache License 2.0
0 stars 0 forks source link

Q2: Why do we need a distributed database in the first place? #7

Open cristiprg opened 7 years ago

cristiprg commented 7 years ago

@jeyhunkarimov From what I understand, we have to maintain a state of the ML algorithms. By state I understand something similar to a model that we maintain and update based on the training data:

Training data ----> Training procedure ----> Model / State.

Then we use this state for prediction:

Query point ----> Apply model -----> prediction.

Now, this model is usually very very small in size (compared to the amount of training/testing data). For example, the model class MultilayerPerceptronClassificationModel [1] that the MultilayerPerceptronClassifier [2] creates after training, contains some "layers" and some "weights", i.e. some ints, absolutely no need for a complicated/distributed data base.

This is why I don't see the need for a system such as Redis. I'm somehow sure that there are some errors in my reasoning, can you point out if I got something wrong?

Edit: Another example, the StreamingLinearAlgorithm has a trainOn method [3], where the model is updated. Is it worth getting the model from Redis here?

[1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L291

[2] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.scala#L234

[3] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/regression/StreamingLinearAlgorithm.scala#L88

jeyhunkarimov commented 7 years ago

@cristiprg

cristiprg commented 7 years ago

@jeyhunkarimov

Currently big systems concentrate continuously training and serving. So training and serving are not separated from each other. This is especially useful in streaming scenarios.

I'm not sure I understand what you mean by training and serving not being separated from each other. Is it that all the design decisions take into account both fast training and fast serving? So the system has to be able to process rapidly training and query points (indeed, this can imply sacrificing some consistency). I can see why this is useful in streaming applications.

jeyhunkarimov commented 7 years ago

@cristiprg Hope this helps: https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-li_mu.pdf

My second point is out of the scope of this project. I noted it to give motivation that there is a need for such systems.

jeyhunkarimov commented 7 years ago

@cristiprg https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html