Decreasing ELBO after a large number of iterations

I've been training the gpuCTM model and I noticed that after the initial phase of increase in ELBO the training started to yield negative delta ELBO values (this happend after ~1000-10000 iterations depending on the data). I was curious as to what would happen if I kept training the model (I thought it's just a temporary increase in gradient to find a better local minimum) and after even more iterations the model simply crashed with a NaN error message.

I've got two questions:

What is the cause for the decreasing ELBO and is this a bug? I didn't notice anything off about the results after ELBO started decreasing, so maybe that's just a numerical instability in computing ELBO, not in the optimization algorithm?
If decreasing ELBO is expected for this implementation what would be a good stopping condition? Should I simply halt the model as soon as ELBO starts decreasing? In this implementation there are some stopping conditions (all the *tol values) but neither of those has been triggered before the ELBO started to decrease.

Hi, thanks for bringing this to my attention!

Ideally the ELBO should never decrease. Variational Bayesian optimization iterates deterministically towards a local maxima and does not perform metaheuristics in search of better/global optima.

~~However, because of the need to invert ill-conditions matrices in the CTM models, ∆elbo can at times go negative early on, and eventually does have a tendency to go negative indefinitely.~~ In the algorithm this numerical instability is lessened by regularizing the covariance matrix once it becomes near-singular, however just like perturbations caused by numerical error, regularization also invariably perturbs the direction of ascent. However what's more concerning to me is the NaN error crash, as significant effort was put into making the optimization algorithms numerically safe.

If I might ask, was this done using one of the prepackaged datasets, or using some other dataset? I could try to troubleshoot this bug, but that would require recreating the conditions which led to it. Would it be possible to share with me the dataset you ran this on? As well as the exact code you used to run it?

Also, did this NaN error crash only occur on the gpuCTM model, or did it also occur on the CTM or fCTM models? Because if it only occured on the GPU, then it could also be a bug in the OpenCL package, or a GPU hardware/firmware specific idiosyncracy, which would be beyond the scope of this package to control for.

Finally, in regard to stopping conditions, the regular tol value is what you want to use to control stopping time, since a ∆elbo below a small tolerance indicates that you're nearing a local optima. However, if you notice the CTM models going negative indefinitely, then there's no reason to continue training, as this indicates that the algorithm can no longer determine the direction of ascent. You might consider setting a lower number of iterations and periodically inspecting the topics visually for coherence, before deciding whether or not to continue training the model.

The corpus that I used is my own, you can find it here (in a JLD2 format) if you need it. As you say, the negative delta ELBO values are eventually showing up (actually they are showing up quite quickly). However, they do not seem to be decreasing in magnitude, which would hint at some sort of convergence, but instead they display some kind of oscillatory behavior. I plotted several quantities of the fitted model and it seems that these oscillations show up in all those quantities to a varying degree. I haven't done that in this case but if I keep on iterating I'll eventually get NaNs (I'll try running the model longer but it'll take some time to fail).

The reason I asked about stopping condition is that in my example it is pretty hard to judge at which point the inference should be stopped by just looking at the ELBO. At first ELBO increases, then it reaches a maximum and starts decreasing smoothly until it reaches the oscillatory regime, when it starts taking sharp periodic steps. It's unclear to me in which of these regimes I should stop the iterations. Also, it is worth noting that the number of iterations I used greatly exceeds the number you used in README.md: I run the model for a total of 20'000 iterations, while you used 150 in the examples. But then again, how did you know to choose 150 steps?

So far I only tested it on the gpuCTM because the cpu version takes much longer to train.

I did some more experiments with different numbers of topics and there always seems to be an eigenvalue of sigma that tends to zero and along with one of the vsq for all data points. This is a speculation on my side but here is one possible explanation for that. It looks to me like the whole model effectively lives on a hyperplane perpendicular to ones(K) because the probabilities p_i=exp(eta_i)/sum(exp(eta_i)) are invariant under eta->eta+x*ones(K) for any x so all the Gaussians can simply be projected onto such plane without any loss of information. The algorithm somehow exploits that and reduces the variance of all the Gaussians in one direction to zero (it can do it without changing the projections of the Gaussians onto the said plane), this reduces the rank of sigma, which seems to be causing numerical instabilities in inv(sigma). I'm not sure what the exact mechanism here is nor why the regularization term is not stopping that from happening (maybe it is, to a degree?).

Now if the above is true, maybe it would make more sense to restrict the space where all the Gaussians live to the hyperplane perpendicular to ones(K)? This would likely remove any such numerical instabilities and perhaps add some marginal performance boost (one less dimension for most of the variables). But then it wouldn't be the algorithm from the paper anymore.

This is an incisive observation, and after taking a look back at Blei's CTM paper I think it's likely correct. Because fitting the hyperparameters of a Bayesian model means propagating information back up through the hierarchy, information about sigma must pass through a probability vector before reaching them, and so with only K-1 dimensions of information available, the best one could ever hope to do is estimate a K-1 rank covariance matrix.

This appears to be a fundamental limitation of the model, since one would prefer to make no assumptions about the correlations between topics, e.g. the topics in a corpus could be independent, which would be outside the scope of the model to capture.

With regard to your final paragraph, that's an interesting idea, project out of the (K-1)-simplex into R^(K-1), and then run the remaining portion of the topic model in K-1 space, this may be worth exploring as an option.

Another option in reworking the theory of the model would be to find a way to allow for convergence to full rank covariance matrices. You could potentially replace the multinomial distribution with something like the multivariate Polya distribution (Dirichlet compound multinomial distribution), which would free you up from having to normalize your input parameter.

As to why Tikhonov regularization is not doing its job, that's still a mystery to me. If you really want to get your hands dirty the CTM.jl file and a small dataset is probably what you'll want to start experimenting with. The file is essentially a block-coordinate ascent algorithm with the CTM struct defined at the top and the train! function at the bottom defined to dispatch on it.

It's possible as you say that regularization is working but only to a degree. I could put an even lower upper-bound on the condition number of sigma. Adding a constant to the diagonal of the covariance matrix is relatively benign, but I'm hesitant to apply too much bias to it, since it's not clear to me what the uninteded affects are to the coordinate ascent algorithm when a biased sigma is fed back into it.

invsigma is used in the update equations for both lambda and vsq, both of which use Newton's method (optimization context). The tendency for the elbo to go negative only after approximately reaching a local optima suggests that I may be able to fix the problem by reworking these optimization algorithms.

In response to your previous comment on Feb 18. I chose 150 iterations because I saw from the small changes in the elbo that after 150 iterations it was approaching a local optima (or I suppose possibly a saddle point), and then at 150 I inspected topic coherence using the showtopics function and they looked good so I went with it. Since your corpus vocabularly are gene names, visually inspecting topic coherence isn't likely to be that useful. In that case I'm inclined to simply stop the algorithm once it appears from the small changes in the elbo that it's approaching a local optima, but before it starts going negative.

Here is a wild idea: normalize sigma right after updating so that the variance is 1 in the direction of w=ones(K)/sqrt(K) and unchanged in the direction perpendicular to it. This should be enough to prevent the rank reduction and can be realized by setting

sigma_norm=S'*sigma*S`
S=I+alpha*Pw
Pw=ones(K,K)/K

where Pw is a projection operator onto w. The choice of alpha comes from a constraint 1=w'*sigma_norm*w=w'*sigma*w*(1+alpha)^2, from which it follows that alpha=1/sqrt(w'*sigma*w)-1.

This way the variance in direction of w is w'*sigma_norm*w=1 (by definition) and v'*sigma_norm*v=v'*sigma*v, if v is perpendicular to w (because Pw*v=0). The convenience of such solution is that it is just three lines in updateSigma!:

    alpha = 1/sqrt(sum(model.sigma)/model.K)-1
    P1=ones((model.K,model.K))/model.K
    model.sigma = (I+alpha*P1)*model.sigma*(I+alpha*P1)

(here I used the special form of w=ones(K)/sqrt(K), which reduces w'*sigma*w to a simple sum).

From what I observed this completely removes the oscillatory behavior and the model stabilizes reasonably quickly (see the notebook). The topics seem to be similar as before so it looks ok to me on the first glance.

If this works maybe the same thing should be done with the vsq, but I didn't want to mess around with the opencl code.

So I took a look into this, and simply removing my Tikhonov regularization code fixed the problem of the long-term negative oscillatory behavior of the ELBO. Furthermore, because the covariance matrix portion of sigma is constructed using the variational means and not probability vectors:

model.sigma = diagm(sum(model.vsq)) / model.M + Base.covm(hcat(model.lambda...), model.mu, 2, false)

sigma does not actually degenerate along the direction orthogonal to the simplex. Instead it appears to degenerate along one of the coordinate axes in topic space. Here is a plot of the L1 norm of the eigenvectors of sigma for a 5 topic model.

Figure_1-1.pdf

A quick experiment suggest that the degenerate topic tends to be the vacuous topic, e.g. from the NSF corpus

topic 1          topic 2       topic 3      topic 4         topic 5
study            data          theory       research        research
research         project       research     study           university
species          study         problems     chemistry       support
plant            research      systems      high            program
cell             earthquake    study        studies         science
understanding    studies       project      materials       students
important        water         work         chemical        dr
studies          ocean         models       properties      award
project          field         analysis     structure       project
plants           time          system       surface         scientists
protein          models        methods      reactions       scientific
cells            provide       design       metal           provide
genetic          processes     equations    experimental    national
data             important     data         electron        engineering
gene             results       problem      systems         projects

however it's difficult to say in general what mechanism governs which topic undergoes this degeneracy.

If regularization should be included, then enforcing unit variance in a direction which is non-orthogonal to the direction of degeneracy appears to be a suitable replacement to Tikhonov regularization, as your experiments indicate that it maintains convergence and topic coherence.

In fact your choice of the simplex's normal vector is still a good choice even if the direction of degeneracy is always along one of the coordinate axes, since one does not a priori know which axis will degerate and this vector is angeled symmetrically with respect to all of them.

I modified your projection code and experimented with enforcing unit variance along the axis of degeneracy and was able to achieve a significantly larger ELBO, however still not as good as with no regularization at all.

Either way, the current Tikhonov code should probably be removed, I may go ahead and push a version of your variance normalization code to the package in the next few days.

ericproffitt / TopicModelsVB.jl

Decreasing ELBO after a large number of iterations #14