WillianFuks / tfcausalimpact

Python Causal Impact Implementation Based on Google's R Package. Built using TensorFlow Probability.
Apache License 2.0
610 stars 72 forks source link

Performance comparison to R package #14

Open mc51 opened 3 years ago

mc51 commented 3 years ago

First of all: Thanks a lot for porting this package to Python. It's greatly appreciated. I'm using the R version quite a lot, often on larger data sets. Mostly, I match a lot of observations in parallel. Hence, speed of computation is an important issue. It would be great to be able to switch to this implementation. However, I'm wondering whether you have any data or experience comparing computation time between the two versions?

WillianFuks commented 3 years ago

Hi @mc51 ,

In the README.md file there's a section that discusses that. Overall results should be equivalent as far as values goes but regarding performance unfortunately the Python package is slower than the original R implementation (more details you can see in the md file).

I think for R they use a Gibbs sampling technique whereas this package uses TensorFlow Probability which offers two options, Hamiltonian Monte Carlo (slower and more precise) and Variational Inference (faster and lower precision). Still, both techniques are slower than R.

As for the parallel processing you mentioned I'm not quite sure what you mean, is it something related to processing the time series in a batch-like manner?

Maybe that would be possible to implement in this package as well, I probably just need to understand better how it's done in the R package.

Best,

Will

mc51 commented 3 years ago

Hi @WillianFuks,

thanks for the quick and thorough reply! Do you have any idea why R is faster? I suspect the R and TF implementation are both written in a lower level language (C or Fortran maybe?). So this should not be a R vs Python issue. Could it just be that a faster algorithm is used in R? Also, do you know if your package would profit from being run on GPUs? Usually, many TF methods are optimized for that.

When talking about parallel processing, I'm referring to cases where I have many treated / test objects. Consequently, I need to do the impact calculation many times over. This can be done by calling CausalImpact in parallel for a great speedup (foreach + doParallel libs in R). However, for a single calculation I don't know if the R packages does any parallel computing, since this won't be as trivial as my example.

WillianFuks commented 3 years ago

Hi @mc51 ,

I suspect that the different algorithms (Gibbs vs HMC) are what explain the contrasting performances indeed.

Notice that GPUs does help and are recommended for this package as well.

As for the parallel processing, you could use the built-in package multiprocessing for accomplishing the same task as in R.