Gaussian emission HMM? - Githubissues

ckemere commented 8 years ago

I wanted to modify the example HMM to additionally estimate the mean and variance of the observations. Oddly, I find that there's rapid convergence to an uninformative model. Any thoughts?

Here's my code (which follows the data generation from the example):

K2 = 10 # >> K
D=2

from bayespy.nodes import Dirichlet
a0 = Dirichlet(1e-3*np.ones(K2))
A = Dirichlet(1e-3*np.ones((K2,K2)))
from bayespy.nodes import CategoricalMarkovChain
Z = CategoricalMarkovChain(a0, A, states=N)

from bayespy.nodes import Gaussian, Wishart
mu_est = Gaussian(np.zeros(D), 1e-5*np.identity(D),plates=(K2,))
Lambda_est = Wishart(D, 1e-5*np.identity(D),plates=())

from bayespy.nodes import Mixture
Y = Mixture(Z, Gaussian, mu_est, Lambda_est)
Y.observe(y)

from bayespy.inference import VB
Q = VB(Y, mu_est, Lambda_est, Z, A, a0)
Q.update(repeat=1000)

jluttine commented 8 years ago

Hi! You need to initialize some of the parameters randomly in order to break the symmetry. Otherwise all mixture components are identical. I think this is the problem you're facing. So initialize the mean randomly:

mu_est.initialize_from_value(np.random.randn(K2, D))

Also, I'd give the variance parameter a better initialization:

Lambda_est.initialize_from_value(np.identity(D))

In order to update Z before mu_est, use:

Q = VB(Y, Z, mu_est, Lambda_est, A, a0)

Also, I noticed that you are not having mixture plates for Lambda_est. Maybe that's what you want, but if you want different variance for each mixture cluster, use

Lambda_est = Wishart(D, 1e-5*np.identity(D),plates=(K2,))

Here is a complete working example:

K2 = 10
D=2

# Just some dummy data now
N = 100
y = np.concatenate([np.random.randn(50,2), 2+np.random.randn(50,2)], axis=0)

from bayespy.nodes import Dirichlet
a0 = Dirichlet(1e-3*np.ones(K2))
A = Dirichlet(1e-3*np.ones((K2,K2)))
from bayespy.nodes import CategoricalMarkovChain
Z = CategoricalMarkovChain(a0, A, states=N)

from bayespy.nodes import Gaussian, Wishart
mu_est = Gaussian(np.zeros(D), 1e-5*np.identity(D),plates=(K2,))
Lambda_est = Wishart(D, 1e-5*np.identity(D),plates=(K2,))  # <- different variance for each cluster?

from bayespy.nodes import Mixture
Y = Mixture(Z, Gaussian, mu_est, Lambda_est)
Y.observe(y)

# Random initialization to break the symmetry
mu_est.initialize_from_value(np.random.randn(K2, D))

# Reasonable initialization for Lambda
Lambda_est.initialize_from_value(np.identity(D))

from bayespy.inference import VB
Q = VB(Y, Z, mu_est, Lambda_est, A, a0)
Q.update(repeat=1000)

from bayespy import plot as bpplt
bpplt.hinton(Z)

If you want to make the learning less sensitive to the initialization, you can try using deterministic annealing: http://www.bayespy.org/user_guide/advanced.html#deterministic-annealing

I hope this helps (and works). Please don't hesitate to ask further questions on this.

ckemere commented 8 years ago

Beautiful. Thanks much! I was wondering how the initialization step was being taken care of!

Edit For a classical Gaussian-emission HMM, there should obviously be plates for Lambda. I was thinking this would imply covariance between the mixture elements, but that was just my brain being confused. Thanks again for that.

It might be worth adding this to your tutorial? The data initialization is here:

# simulated data
mu = np.array([ [0,0], [3,4], [6,0] ])
D = 2
std = 2.0
K = 3 # number of "states"
N = 200 # number of samples
p0 = np.ones(K) / K
q = 0.9 # self probability
r = (1-q) / (K -1)
P = q*np.identity(K) + r * (np.ones((3,3)) - np.identity(3)) # transition probability matrix

#run simulation
y = np.zeros((N,2))
z = np.zeros(N)
state = np.random.choice(K, p=p0)
for n in range(N) :
    z[n] = state
    y[n,:] = std * np.random.randn(2) + mu[state]
    state = np.random.choice(K, p=P[state])

jluttine commented 8 years ago

After I fix this issues: https://github.com/bayespy/bayespy/issues/30 you could initialize Z randomly with Z.initialize_from_random(), and then update mean and variance before Z. Would be better that way, in my opinion. Also, some other steps can be used to improve the accuracy of the VB approximation and to reduce the sensitivity to the initialization but it will make things a little bit more complex. First, as I mentioned, you could use deterministic annealing. Second, you could use GaussianWishart or GaussianGammaARD nodes to model the mean and variance in a single node. If you want diagonal variance, you can use Gamma and GaussianARD. I can give more details another time.

ckemere commented 8 years ago

The annealing seems to make things a bit more stable, but interestingly training with two sequences makes the estimation much more stable. Presumably this is because a0 - the initial state distribution is arbitrary in the one-sequence case, but has evidence in the two-sequence case (resulting in un-broken symmetries).