Razcle commented 6 years ago

Hi,

Im trying to replicate the Mixture of Gaussians results from your paper but cant seem to get it to mix between the two modes. I've tried with two gaussians at (-10, 0) and (10, 0) each of which have diagonal covariances with ones down the diagonal. I've also tried the same two gaussians from the paper.

I'm currently using the same network structure as is in your IPython notebook and am training for 5000 iterations with 200 samples per iteration.

I've tried adding a temperature and annealing that as well but that hasn't seemed to help.

It also seems to be running extremely slowly, taking 30 minutes to train on cpu. Is this normal or am I doing something wrong?

I've attached my code and images of the samples drawn after training. As you can see the samples only explore one of the two modes.

many thanks!

mogplot

mogexperiments.py.zip

daniellevy commented 6 years ago

Hello,

Would you mind pasting your code in a gist so I can look at it more carefully? If I remember correctly, for my experiments, I started the annealing at T=5 and decreasing with a factor of 0.98 every 100 or 200 iterations. In total, I think for the mog, I required around 20k iterations. However, the gaussians were closer and the variance was smaller (still ~10 stddev) so that might impact the hyper-parameters.

The slowness is really abnormal; on CPU for a 2-d problem your sampler should be running at 5k iterations/min approximately.

Did you visualize the intermediate chains (when the temperature is high) to debug?

Let me know!

Best, Daniel

Razcle commented 6 years ago

Hi Daniel,

thanks for the response and for sharing the code. If you trained for 20k iterations, you may want to make a correction to Appendix C of your paper where you say:

"In Section 5.1, the Q, S, T are neural networks with 2 hidden layers with 10 (100 for the 50-d ICG) units and ReLU non-linearities. We train with Adam (Kingma & Ba, 2014) and a learning rate α = 10−3. We train for 5, 000 iterations with a batch size of 200"

Do you know if you used a different batch-size or made any other changes as well?

On the point of slowness: I re-downloaded the repo without any of my changes and ran the example notebook. I found that training the 5,000 iterations for the strongly condtioned Gaussian took 491s or roughly 8 minutes. This seems to be roughly 10x slower than your experience. Do you mind re-running and checking that the problem really is at my end?

I re-ran the MoG experiment with exactly the same Gaussians as in your paper for 5000 iterations with the temperature schedule suggested above. I found it took 17.5 minutes to run 5000 iterations, so again I'm about an order of magnitude slower than you suggest.

Here is the initial distribution:

init_mog

and here is an example chain sampled after training:

final_mog_samples_with_temp

Here is a gist for the code:

https://gist.github.com/Razcle/7eaac31a171ad43cb1e6443ea1b7eb66

If I've done something wrong in running it, I'd really appreciate a second pair of eyes.

many thanks,

Raza

daniellevy commented 6 years ago

As I said, when annealing, you should have naturally more iterations so try switching from 5k to 20k. Have you looked at intermediate chains? In the limit where you anneal very very slowly, you should always be able to mix between modes.

I might try adding a notebook in the next few days to show a working L2HMC on GMM.

Razcle commented 6 years ago

ok thanks!

I'll run it for longer and report back. Thanks for the plan to add a new notebook as well.

What should I be looking for in the intermediate chains?

Any idea on how to accelerate running times? Any chance you could run the example notebook and just let me know roughly how long it takes for you? I just want to figure out if there is something wrong with my set-up.

Thanks.

saforem2 commented 6 years ago

Hello,

I've used the gist you submitted to test the GMM model, and I believe the issue with the gist you provided is on line 81, where you have:

samples = np.random.randn(n_samples, x_dim)

which generates n_samples, each of shape x_dim from the standard normal distribution.

Instead, we want our samples to be drawn from the specified distribution (in this case GMM), so the line should instead read:

samples = distribution.get_samples(n_samples)

I've verified that using samples as correctly defined above then works as expected.

Razcle commented 6 years ago

Ok thanks for taking a look.

In line 81, we're just looking to initialise our MCMC chain.

Whilst its definitely true that initialising on samples from the true target distribution will make this work better (since we don't need to burn in at all), its not very realistic. Ordinarily when using the sampler I wont be able to do this. If I could then I wouldn't need the sampler in the first place. For this toy problem that works but it wont work as a solution in general.

So I don't think its a good idea to change the initialisation on that line. I'd be happy to initialise from zeros though.

whilst your here, can I ask what you've found for how long it takes to run the samplers?

thanks

saforem2 commented 6 years ago

Ahh you're right actually, I apologize.

I'll have to double check (I didn't save the times from my last run) but I believe using 10,000 training steps with a batch size of 200 took roughly ~40 minutes which seems to be similar to your experience.

Edit:

After re-running my MoG code, using a chain initialized from zeros,

i.e.

n_steps=10000 n_samples=200 losses = [] samples = np.zeros((n_samples, x_dim))

I found that it took

523.97 seconds to train for the first 5000 training steps, and

504.33 seconds to train for the remaining 5000 steps

or 1028.3 seconds ~ 17 minutes to train the full 10,000 steps.

Using zero initialization here are my results for an intermediate chain (after 5000 training steps)

mog_mcmc_chain_1000t_5000train

And after 10,000 training steps

mog_mcmc_chain_1000t_10000train

Compared to standard HMC with eps = 0.15, which is unable to mix between modes mog_hmc_chain_e015_1000t

Razcle commented 6 years ago

@saforem2 Did you use temperature annealing to get this to work and if so what schedule did you use?

Are the samples you plot a single chain with the temperature set to 1 or at a different intermediate temperature. If those are samples from an intermediate temperature than I dont think they represent samples from the correct target distribution.

Also when you say an intermediate chain, what exactly do you mean? Do you mean that during training you paused training and ran an MCMC chain. Or do you mean those are the 200 samples from the 200 parallel chains that we run during training?

Also are you running on GPU or CPU?

@daniellevy Any chance you've had time to put together a gist? I'm releasing an arxiv paper and I'd love to be able to compare against your method :)

saforem2 commented 6 years ago

@Razcle I am running on CPU.

I used temperature annealing, using @daniellevy 's suggestion starting at T = 5 and decreasing by 0.98 every 100 iterations.

For the intermediate chain, you're correct. I meant that I start with T = 5, train for 5000 steps (using the above annealing schedule) and then generate a MCMC chain using the most recently used temperature from the annealing schedule and plotted the trajectory.

Using Gaussians centered at +- 2 with covariance matrices of 0.1 along the diagonal after 20,000 training steps I found that at dynamics.temp = 1., the trajectory was unable to mix between the two modes, similar to your result.

As a check, I then tried repeating the experiment using Gaussians defined as

means = [np.array([1.0, 0.0]).astype(np.float32), 
         np.array([-1.0, 0.0]).astype(np.float32)]
covs = [np.array([[0.05, 0.0],[0.0, 0.05]]), 
        np.array([[0.05, 0.0],[0.0, 0.05]])]
distribution = GMM(means, covs, [0.5, 0.5])

I trained on this model for 5,000 steps using a step size of eps=0.1, and a scale factor of scale=0.1 for the loss function.

I then generated 200 separate trajectories using the learned model and found that each of them were clearly able to mix between the two modes. I've included the first 100 steps of an example trajectory (again, generated at dynamics.temp = 1.), shown below

mog_mcmc_chain_100t_5000train_mixing1

Other relevant hyperparameters:

learning_rate = tf.train.exponential_decay(1e-3, global_step, 1000, 0.96, 
                                           staircase=True)
n_samples =  200

Razcle commented 6 years ago

@daniellevy Since both @saforem2 and I have failed to replicate the result from the paper could you please lend us a hand in getting this to work?

Thanks, Raza

jxy commented 6 years ago

Try starting with temperature T=10, so that the two Gaussians are only 4 σ apart.

saforem2 commented 6 years ago

@jxy is correct, using a higher starting temperature, T = 10 in this case solves the issue and even after just 5000 training steps the trajectories are able to tunnel between the two modes.

Thanks!

mog_mcmc_chain_100t_5000train_mu2

Razcle commented 6 years ago

That's great thanks! I'll rerun the experiments and report back

Razcle commented 6 years ago

@saforem2 One quick question. Looking at your pictures it would seem that the variance of your gaussians is ~1 but your earlier message suggested you were using a variance of around 0.05?

what is the actual variance you used or am I missing something obvious? Also when you say that you can tunnel between the two modes after 5000 iterations, is that at temperature set to 1?

R

Razcle commented 6 years ago

I've now tried annealing from a higher temperature (10) for 20000 iterations and I still don't see mixing between the modes of two isotropic gaussians with means (-4,0) and (4, 0). Variance is 0.1.

jxy commented 6 years ago

Try raising the temperature for +/-4. Keep two gaussians 4 sigma apart, which means start with temperature 40.

saforem2 commented 6 years ago

@jxy @Razcle

Can confirm, after using an initial temperature of temp = 40, and training for 10,000 training steps I've found that the learned trajectories are able to tunnel between modes.

2d_mog_trajectory_2000t_10000train_4

Razcle commented 6 years ago

thanks @saforem2! that looks great! What variance are you using? It still looks to be somewhere between 0.5 and 1?

any chance you could pop your code in a gist or even submit it as a pull request?

thanks again!

Raza

saforem2 commented 6 years ago

The variance is still 0.1, I've pushed the notebook for running the experiment to my forked repo, you can get the jupyter notebook for the MoG model here.

I also added some logging to help visualize the loss and various other parameters using tensorboard in the original code so it might be worthwhile to submit a pull request.

Edit:

I just noticed you'll need to comment out the last line in the local import (In [3]:) from utils.logging import variable_summaries

Also, just incase you're not familiar with tensorboard you can start it from the command line with `tensorboard --logdir='/path/to/log_dir/'

Where log_dir is defined in In [5]: in my notebook.

Razcle commented 6 years ago

thanks again @saforem2, much appreciated. I realised why the spread seems so large. For a Gaussian with variance 0.1, the std is about 0.3 so you expect almost all the samples to fall in the range -0.9 to 0.9 which is what we see.

brain-research / l2hmc

Mixture of Gaussians #4

Edit:

Edit: