No, it's not distributed per-se. The reason is that I'm using this algorithm inside another model which is already distributed. So if I try to use RDD's inside this implementation then I will have issues with nested RDD's once I use it inside my second distributed model. In my case, this model works perfect because I need to compute thousands of Vector Autoregressions but all of them on small data, so each single Vector Autoregression is not expensive, the expensive part comes when I apply many of them.
There are a few options however to make this distributed, for example you can broadcast the dense matrix from which I apply multiple linear regressions inside the methods lagSelection and fit, and also you could partition the indexes on the loops and include a few maps instead.
I can try to make those changes in order to make this distributed, just point me out some data on which you see performance issues with this implementation.
No, it's not distributed per-se. The reason is that I'm using this algorithm inside another model which is already distributed. So if I try to use RDD's inside this implementation then I will have issues with nested RDD's once I use it inside my second distributed model. In my case, this model works perfect because I need to compute thousands of Vector Autoregressions but all of them on small data, so each single Vector Autoregression is not expensive, the expensive part comes when I apply many of them.
Notice also that the official implementation of Univariate Vector Autoregression https://github.com/sryza/spark-timeseries/blob/master/src/main/scala/com/cloudera/sparkts/models/Autoregression.scala also makes use of local vectors and math3.stat.regression.
There are a few options however to make this distributed, for example you can broadcast the dense matrix from which I apply multiple linear regressions inside the methods lagSelection and fit, and also you could partition the indexes on the loops and include a few maps instead.
I can try to make those changes in order to make this distributed, just point me out some data on which you see performance issues with this implementation.