Unstable number of VB-iterations

ajafarihub commented 3 years ago

Not 100% sure, but this is what I have observed in one case-study I am currently performing, which is the VB inference of a gradient-damage material model using FEMU-F.

If I do either of:

scale some part of my forward-model output (and respectively adjust the corresponding part in Jacobian and prior noises, as well)
use different actualizations for the input parameters; e.g.: "par_actual=some_reference_constant exp(par_dimensionless)" OR "par_actual = some_reference_constant par_dimensionless"

, then I get different outputs from the VB, being as a result of different number of iterations the VB performs.

Again emphasize that I am not 100% sure if I have not made some mistake somewhere. Just maybe to keep in mind if anyone else will see such a behavior in his/her own examples.

TTitscher commented 3 years ago

I don't know if you like to hear some recommendations, anyways. Regarding the scaling:

The parameter update always includes the noise mean, so a scaling should have no effect, e.g.

L = sum([s[i] * c[i] * J[i].T @ J[i] for i in noise0]) + L0

The noise update is not that easy to reason about and the free energy update with its log(s) and digamma(c) is even worse. As this issue affects you, I recommend that you use a simple test case (model_error = deterministic_scaling * (prm[0] * x - data) or something) to investigate it further. Maybe the tolerance for the free energy needs to adapted as well...?

Regarding your second point: That boils down to a normal vs a lognormal prior, right? I would expect a difference there...

ajafarihub commented 3 years ago

Sure, I made this issue - as you wanted me - to hopefully get some clues to figure out possible reasons. But just to emphasize again, I am not yet so sure of not having made any mistake elsewhere. I also think that I need to monitor the free energy to get an idea if the tolerance must be adjusted, as well (as you also suggested). Also, I scale not the full output vector, but only a part of it (related to displacements).

To answer your last question, yes, it is about normal or lognormal distributions, and I get different results due to different number of VB-iterations. May I ask you why you expect this? I only saw a sentence in the Chappel paper: "In general parameterizing the model to make it as close to linear as possible will result in fewer convergence problems." . So, would it be the underlying reason?

TTitscher commented 3 years ago

Scaling:

You indicated that the issue may be related to a bug on your side. That is why I suggested you implement the most minimal test case that shows the behavior you described. Based on that, we can help to check the correctness of the test and, once that is done, think further.

normal vs lognormal:

In your first example, par_actual is a normally distributed, in the second example it is lognormal(?). So I would like to reverse the questions: Why would you expect two different distributions/models to behave equally?

ajafarihub commented 3 years ago

Let me explain two scenarios regarding parameterization:

Exponential

par_actual = ref * exp(par_dimensionless)

In this case, the distribution of "par_actual" is lognormal and "par_dimensionless" is normally distributed.

Linear

par_actual = ref * par_dimensionless

In this case, the distributions of both "par_actual" and "par_dimensionless" are normal.

In both cases, VB calibrates "par_dimensionless" (normally distributed). What I am saying is that, I am simply expecting quite a robust inference scheme, i.e. I would not expect to get too different results just due to different parameterizations, in particular observing that in each case the VB converges with totally different number of iterations (54 versus 14 !).

joergfunger commented 3 years ago

But in both cases, you linearize your model, and in the first case, you apply an exponential function, which is highly nonlinear.

joergfunger commented 3 years ago

Did you include this in your computation of J?

ajafarihub commented 3 years ago

well ..., @joergfunger this idea of using lognormal distribution for physically non-negative parameters is what you recommended me to do, and I think it is very logical, unless it somehow adds more nonlinearity to the model. And the VB-paper has directly said that, nonlinearity can influence the convergence. But how much?, not clear !

And yes, I did consider it in Jacobian, as well.

I'll today push a minimal example for these two aspects: parameterization (of latent parameters) & weighting factor (in model-error output).

ajafarihub commented 3 years ago

Here is the minimal example: https://github.com/BAMresearch/BayesianInference/blob/robustness/tests/test_robustness_vb_minimal.py

For the simple linear forward models already uncommented (lines 30 and 33), all 4 scenarios are very much the same. But if we use the other commented forward models (lines 31 and 34) which are more nonlinear, we see some differences. I am very happy if you could also play around them and find out some clues.

TTitscher commented 3 years ago

Thanks for providing an example. My interpretation: The scaling ...

does not influence the parameter posterior (expected)
does influence the noise posterior. Its mean by a factor of 10000 for scaling 100 and by a factor of 4000 for scaling 200. Is that expected?
The free energy is totally different... (expected by the formulas, but IMO not quantifyable)

I'm still wrapping my head around this actualizer thing...

edit What are your conclusions then? Is it a bug in VB or in your code?

ajafarihub commented 3 years ago

I did not get the second point you listed above (with a question at the end). Actualizer is some pre-processing on the VB-latent parameters to transfer them into physical values. Important to have in mind: it is about input parameters. I have no 100% conclusion, but I think:

the convergence parameters (like: maximum # of iterations, tolerance, etc.) might make a difference in some situations.
the parameterization (using that actualizer thing) could change the nonlinearity of the model, so that we get slightly or even remarkably different results.
I think it is quite useful to post-process a plot for the free energy in the VB (just in general). This plot is also depicted in the Chappel paper, so, it somehow signals the relevance of having it.
To your question: "Is it a bug in VB or in your code?", ... well, maybe it is not a matter of bug at all, but rather it has to do with kind of restrictions/observations that a user should always keep in mind when utilizing the VB. I again think of one sentence in the paper that says: "Parameterizing the model to make it as close to linear as possible will result in fewer convergence problems.".

TTitscher commented 3 years ago

does influence the noise posterior. Its mean by a factor of 10000 for scaling 100 and by a factor of 4000 for scaling 200. Is that expected?

Sorry, I again forgot that this is the precision. So the SD of the noise goes up by a factor of 100, so the precision goes down to 1/100². Makes sense IMO.

Regarding the parametrization. Again, you effectively compare a model like a x - data with exp(a) x - data' which is a different model and different data. So I would be very surprised, if the results are equal.

And regarding your comments: What could be a strategy to become 100% sure on the thing?

ajafarihub commented 3 years ago

About the strategy, I think:

focus more on the convergence criteria and see if there is any room for improvement
go to the details of deriving F (free energy), to get a better understanding of it
monitor F and its components (written in eq. 23 of Chappel paper) for some minimal example with different scaling factors, to get better feeling about the situation
maybe not so elegant: just contact the author of the paper and somehow ask him about this issue: "why does F change after scaling the model output? How can this be justified?". Personally, I would not bother myself to contact him directly as a PhD candidate (let say with my university Email). So, he might hopefully give us good clues. What would you think of this?

TTitscher commented 3 years ago

My findings so far, but I have to stop now, sorry. Instead of properly understanding, what is going on, I tried some easy fixes. Spoiler: No success so far.

The parameter update equations contain terms like

s[i] * c[i] * J[i].T @ J[i]
s[i] * c[i] * k[i].T @ k[i]

so the scaling factor (k.T @ k == scaling²) is multiplied by the noise precision, which scales with 1/scaling². That is why we get exactly the same parameter posterior. BUT the free energy term does not scale k with s*c. That could be a typo and without further thought about it, I added it to the formula.

The other terms in F that now differ are the ones with log(s) digamma(c). Technically you could adapt the shape and the scale of the noise prior such that the resulting free energies for two different scaling factors are equal. (And I did that manually.)

So question for the experts.

Could it be that the third term in F misses a factor (s*c)? An exact derivation of F, as you suggested Abbas, would clarify this.
How should the noise prior be chosen? I find (s1 c1 / s2 c2 == 1/scaling²) reasonable. But what about the individual shape/scales? They heavily influence F.

joergfunger commented 3 years ago

I agree, I think the derivation of F (eq. 3) its relation to the model evidence and the derivation of the update equations would be essential to get further inside. As for the last suggest, I would first derive the equations and try to find out, where in the derivation the problem is, and then contact the author. For example, I don't fully understand, why in eq 3 the term in the log (that approaches 1 in the optimal case of q being the exact posterior) make sense. Because we then marginalize over q(w), with a weighting that is zero (log (P(y/w)*P(w)/q(w)), but sometimes larger than 0, and sometimes less (depending on the ratio of the distributions). Thus the values might even cancel out.

TTitscher commented 3 years ago

I found this test case that I somehow forgot to commit.

Infers the parameters A and B of the linear model error
    data = A' * x + B' + N(0, sd)
    k([A, B]) = scaling * (A * x + B - data)
for different values of "scaling" and compares the resulting free energy F
from bayes.vb.variational_bayes.

I test for equal F in both cases, which fails. Is that expected? If both Fs should be equal, this could be used to verify the derived formula for F in #31

joergfunger commented 3 years ago

I'm not sure, but I would guess the free energy itself should not be scale independent, just the location of the maximum.

ajafarihub commented 3 years ago

I tried to use the last equation of attached picture for updating posterior mean of parameters based on fix-point method in the VB iterations. I however got very poor behavior, saying that something should be wrong in going this way. I am questioning myself why? To my eye, this is also an alternative equation like m=f(m,...) as a basis for the fix-point method. But it does not work in any means. Any idea or comments ?! update_m_way2

joergfunger commented 3 years ago

How do you derive the equations from the top two?

ajafarihub commented 3 years ago

... well, I simply plugged first one in the second one and then cancelled out same terms in both sides (s.c.Jt.J.m). Isn't it correct ?

ajafarihub commented 3 years ago

This is the piece of code that I changed (current code is uncommented and the new iteration is replaced). Maybe you could also try it and see how it works for your examples.

# Lm = sum([s[i] * c[i] * J[i].T @ (k[i] + J[i] @ m) for i in noise0])
# Lm += L0 @ m0
# m = Lm @ L_inv

rhs = sum([s[i] * c[i] * J[i].T @ k[i] for i in noise0])
L0_inv = np.linalg.inv(L0)
m = m0 + L0_inv @ rhs

ic-lima commented 3 years ago

I would say that you cannot cancel the terms with m, because, if you look at Chappell paper (eq 20), you have m_new on the LHS and m_old on the RHS...

ajafarihub commented 3 years ago

I see your point @ic-lima . However, I started to play around the algebraic equations that Annika derived (prior to any iteration), so, I wanted to use another equation like m=f(m,...) in the fixed-point iteration. But that worked very very bad ! So, the question in my mind is how to justify this behavior.

ajafarihub commented 3 years ago

The only justification that I could find is that, basically, fixed-point method is very much depending on how one define his/her fixed-functions (X=f(X)). So, the method might work well for one choice but fails for another one.

joergfunger commented 3 years ago

I would say you should get the same solution either by using fixed point iteration or directly substitution.

aradermacher commented 3 years ago

I would agree with you Jörg. But I checked it and by canceling the sc J^T J m part, I saw the same weird behavior as Abbas.

BAMresearch / bayem

Unstable number of VB-iterations #18