About theoretical background

franchesoni commented 5 years ago

I don't quite understand how does the code calculate the coefficients. In particular, how is log likelihood related to the eigenvalues that form the cost function.

I'd be glad if you could provide a reference or explanation

Thanks

FlorianWilhelm commented 5 years ago

Hi, you could take a look into the references inside https://cran.r-project.org/web/packages/dse/vignettes/Guide.pdf

franchesoni commented 5 years ago

@FlorianWilhelm I've red the guide you sent, and also the references there contained. There's some part of the S code that is similar to your code, but I could not find (maybe I'm not a good searcher) a theoretical explanation of the log likelihood calculation.

Probably there is some optimization/statistics theory I don't know, but Gilbert's publications don't seem to have an answer. A cost function build upon eigenvalues of a matrix of correlations (between variables with no lag) puzzles me.

Thanks for previous response, hope you can help me

FlorianWilhelm commented 5 years ago

@franchesoni, note that the product of all eigenvalues of a matrix equals the determinant of a matrix. In the pdf of a multivariate normal distribution you need to calculate the determinant of the covariance matrix. Got it?

franchesoni commented 5 years ago

I got that. This is what I understood:

I should assume that the residuals are independent and identically distributed according to a multivariate normal distribution with zero mean.
Then I can write down the negative log likelihood and identify the terms 'like1' and 'const' of your code.

Is the other term that I don't understand yet. Somehow (lack of latex here), res_i^T cov^{-1} res_i is the same that len(s).

Could you clarify that last step? thanks!

FlorianWilhelm commented 5 years ago

It's actually only a bit of linear algebra. Just keep in mind that cov = res^t * res and do some transformations and you arrive at the number of non-zero eigenvalues which is len(s). Got it?

franchesoni commented 5 years ago

Thanks for the response!

Unfortunately, it does not seem immediate at all to me (and I've tried!). I would appreciate it if you could write that bit of linear algebra I'm lacking of. Sorry for the nitpicking, it's the final step!

FlorianWilhelm commented 5 years ago

Maybe I underestimated this... it was quite some time that I programmed it and I also peeked a bit into the implementation of DSE. I would say that in general res_i^T cov^{-1} res_i does not equal len(s) as you can be easily seen by constructing a counter example. So maybe it's just an approximation if you sum over all.

Also, let's assume we only have a univariate time series, thus cov^{-1} is just a scalar that only equals 1 = len(s) if that's the std of res. In this case the sum over all res_i^T cov^{-1} res_i could be approximated with sampleT * len(s) due to the definition of cov. In the general case where cov != 1 for p =1 or if we have eigenvalues not equal 1 for p > 1 this is not the case. At least that are my findings after putting one hour of thought into it.

Please tell me a bit more about your motivations in knowing this? Is this some student work or why are you so eager to know that technical detail? Have you also tried to contact the original author of DSE? In general PyDSE is more a prototype and not actively developed anymore...

franchesoni commented 5 years ago

@FlorianWilhelm , thanks for taking the time. I appreciate it.

My main motivation, aside from learning ambitions, comes from my job. I work in the Solar Energy Lab in forecasting through temporal series analysis, and I'm exploring the performance of ARMA models (now I'm comparing static digital filtering with statsmodels' Kalman filtering approach).

As I don't have anyone double checking my work, I can't copy or implement things without making sure that they are correct (not that I doubt you, it's just that I can't leave giant bugs or even think about publishing unfunded research). Nevertheless, your code is great, just what I need.

As I found analytically, when p = 1 and the res is non zero, the equality holds (squared L2 norms go away leaving only sampleT). As you said, the general case is not that obvious (at least to me). I will continue asking elsewhere as you suggest, with the confidence on the correctness of the method (I've made a few tests).

Thanks for answering all of it! Cheers

FlorianWilhelm commented 5 years ago

How about that @franchesoni? Best you copy it in some LaTeX aware editor.

We have for $r\in\mathcal{R}^{n\times p}$ that $\frac{r^tr}{n} = \mathrm{cov}\in\mathcal{R}^{p\times p}$ and now we want to calculate $\sum_{i=1}^{sampleT}r_i\mathrm{var}^{-1}r_i^t$ where $r_i\in\mathcal{R}^{1\times p}$ denotes the $i$-th row of $r$.

Putting everything together, we have that $\sum_{i=1}^{sampleT}r_i\mathrm{var}^{-1}ri^t = \sum{i,j=1}^{sampleT,p}r\mathrm{var}^{-1}\circ r$ where $\circ$ is the Hadamard product. Since $\sum{i,j}(A\circ B){i,j}=\mathrm{trace}\left(B^{\mathrm {T} }A\right)$ we have $\sum_{i,j=1}^{sampleT,p}r\mathrm{var}^{-1}\circ r =\mathrm{trace}\left(r^tr\mathrm{var}^{-1}\right)=\mathrm{trace}\left(n\mathrm{var}\cdot \mathrm{var}^{-1}\right)=n\cdot p$. For the degenerative case we have for p the number of non-zero eigenvalues.

franchesoni commented 5 years ago

Great @FlorianWilhelm ! It was just some linear algebra indeed.

About the degenerate case, I asked in cross-validated and there is an intuitive explanation that relates with PCA. Basically, when some eigenvalue of the covariance is negligible, we can use the principal component representation of the data, and arrive at the case you just nicely covered.

I've not done the math yet, but the intuition is there. Thanks for the very helpful responses!

blue-yonder / pydse

About theoretical background #4