ethanluoyc / e2c-pytorch

E2C implementation in PyTorch
Apache License 2.0
43 stars 9 forks source link

Did u try some test on the simulated inverted pendulum dataset ? #1

Open ZhengYi0310 opened 6 years ago

ZhengYi0310 commented 6 years ago

Hey Ethan My name is Yi Zheng. Currently I also implemented the e2c using tensorflow e2c implementation. To test the model, I first turn off the dynamics part, so the model is just a vae, and I tested it on MINST, the results look alright:

minst

Then I use your code to generate the simulated pendulum image from gym, and put them to the model, the results using the sam Adam optimizer, set the hyperparams according to the , training epoch are like this: inverted_pendulum_result So the vae pretty much failed completely on the inverted pendulum data-set, and I double-checked my code but couldn't find any problem. And I could figure out why the model can work on a set of 0-1 images but not another. Did you test your implementation on the inverted pendulum dataset ? How are the results like ?

ZhengYi0310 commented 6 years ago

Looks like I scale the images at the wrong place, before when I load the dataset in datasets.py, I divide the image array by 255. to convert it to 0~1.0. Now instead of doing this, I scale the decoder output by 255. and it gives me the following, though I don't quite understand why this would happen. screenshot from 2017-10-30 20 23 36

ZhengYi0310 commented 6 years ago

Hi, I still could get this working, the model gave me good reconstruction because I changed the reconstruction loss from cross entropy to mse error, but that actually makes the model AE rather than VAE right ? And AE couldn't learn the latent embedding: screenshot from 2017-10-31 20 46 35

Sorry to bug you like this, but this is really confusing, do you have any idea or can you reproduce the result from the vae paper on using your own implementation and datasets?

Thanks very much for your time and patience!

ethanluoyc commented 6 years ago

Hi, sorry for the late reply!

Regarding the bad reconstruction post. I got that as well initially. try to avoid exploding gradient. You can add BatchNorm which helps a lot. Also scale it to be between 0 to 1. If you backprop with the value you multiply by 255, you may have a very large gradient which would likely cause problems.

We can reproduce the results to a reasonable extent. In fact, I might have a good viz for what I would get if I try to reproduce their result. For example, see below: newplot 3 From your plot probably you have missed something somewhere.

ZhengYi0310 commented 6 years ago

Hi Ethan, Thanks for the reply! Sorry for the confusion, when I say VAE/AE, I still mean the E2C model they proposed, but I think specifically, I mean when computing the reconstruction error, would the use of mse instead of cross entropy make the encoder and decoder in the E2C model behave more like a autoencoder, which does deterministic mapping, rather than a variational auto encoder, which does probabilistic mapping. I end up using your implementation, here is my trainig script train_e2c.py. And I followed what suggest, in my datasets.py, I divide each pixel value by 255.0 to scale the image to [0, 1]. and then I converted it from a gray image to a black-white binary image, as the following code show:

preprocess With 3000 training examples, batch-size 100, epochs 350, I get the following latent embedding: 350_training_latent_embedding

Which is still not reasonable as the one you got, the joint state is not as separable as the one in your viz, also the structure of the shape is not that right. The reconstruction and one-prediction prediction in the training process looks alright: reconstruction: 350_training_reconst prediction: 350_training_prediction

But the reconstruction and one-step prediction in the test phase (1000 test examples) is bad: reconstruction: 350_test_reconstruction

prediction: 350_test_prediction

From the surface, it looks like the batchnorm layer is not trained properly. So what I tried is comment out every nn.batchnorm1d in the PendulumEncoder, PendulumDecoder, PendulumTransition, and increase the training epoch number, (from 250 to 1000), here is the latent embedding I get: latent_embedding_training_nobatch the latent embedding looks slightly better than the previous one, but it still doesn't look like what you got, your result has a clearer boundary. I assume the color map in your figure is also joint positions, how does the latent embedding with joint velocity looks like in your result ? I'm just not sure if I wrote everything correctly and the models learns a meaningful latent embedding.

The reconstruction and prediction performance in the training phase degrades: reconstruction: reconst_training_nobatchnorm prediction: predictin_trainig_nobatchnorm

But the reconstruction performance and one step prediction performance in the test phase is actually better compared to the model that has Batchnorm1d layer after each Relu(): reconstruction: test_reconstrcut_nobatchnorm prediction: predict_test_nobatchnorm But compared to the results in the paper, it's still not good. Here only after one step, there is a decent amount of prediction uncertainty. But in the paper, the model can perform 10 steps predictions with very small uncertainty. and here is the latent embedding for the 1000 test images: test_latentembedding_nobatchnorm

So, considering the different results with or without batchnorm layer, I'm still not sure if the models learns a good latent embedding. Also, how does your reconstruction and one step prediction results are like ?

Thanks for your time and patience!

ethanluoyc commented 6 years ago

What else I would suggest is to just do a sanity check of the loss function. This is an early version of the code I have. The loss function should be implemented correctly though.

Some code I copy pasted here (I reimplemented a few stuff after I made public this version)

def _KLDGaussian(Q, N, eps=1e-8):
    """KL Divergence between two Gaussians
        Assuming Q ~ N(mu0, A\sigma_0A') where A = I + vr^{T}
        and      N ~ N(mu1, \sigma_1)
    """
    sum = lambda x: torch.sum(x, dim=1)
    k = float(Q.mean.size(1))  # dimension of distribution
    mu0, v, r, mu1 = Q.mean, Q.v, Q.r, N.mean

    Qvar = Q.logvar.exp(); Nvar = N.logvar.exp()
    Nlogstd, Qlogstd = N.logvar.mul(.5), Q.logvar.mul(.5)

    s02, s12 = Qvar + eps, Nvar + eps
    a = sum(s02 * (1. + 2. * v * r) / s12) + sum(v.pow(2) / s12) * sum(r.pow(2) * s02)  # trace term
    b = sum((mu1 - mu0).pow(2) / s12)  # difference-of-means term
    c = 2. * (sum(Nlogstd - Qlogstd) - torch.log(1. + torch.clamp(sum(v * r), 0, 1e5)))  # ratio-of-determinants term.

    #
    # print('trace: %x' % a)
    # print('mu_diff: %x' % b)
    # print('k: %x' % k)
    # print('det: %x' % c)
    return 0.5 * (a + b - k + c)

def loss_function(model, X, U, X_next, opt):
    from torch.nn.modules import BCELoss
    from pixel2torque.losses import kl_std_gaussian
    lambd = opt['lambd']
    X_reconst, z, Qz = model.encdec(X)
    X_next_reconst, z_next, Qz_next = model.encdec(X_next)

    # assert X.size() == X_next.size() == X_next_reconst.size()

    z_next_pred, Qz_next_pred = model.trans(z, Qz, U)
    X_next_pred_reconst = model.decode(z_next_pred)

    kl_next_pred = _KLDGaussian(Qz_next_pred, Qz_next).sum()

    X_reconst_loss = BCELoss(size_average=False)(X_reconst, X)
    X_next_pred_reconst_loss = BCELoss(size_average=False)(X_next_pred_reconst,
                                                           X_next)

    KL_z = kl_std_gaussian(Qz.mean, Qz.logvar).sum()

    loss = X_reconst_loss.add(X_next_pred_reconst_loss). \
        add(KL_z.mul(opt['beta'])).add(kl_next_pred.mul(lambd))

    return loss, dict(loss=loss,
                      x_reconst=X_reconst_loss,
                      x_next_pred_reconst=X_next_pred_reconst_loss,
                      kl_z=KL_z,
                      kl_next_pred=kl_next_pred)

Maybe you want to compare this with the one in this repo. I did it. It seems that it is fine. Also what hyper-parameters do you use?

ZhengYi0310 commented 6 years ago

Hi Ethan, Thanks for the reply, I'll compare the loss function in your post with the one in the repo, although from the first look, they are the same. All hyper-parameters of the network layers are kept to default. For the training, I use the Adam optimizer, learning rate = 3e-4, beta1 = 0.1, these two values come from the e2c paper, and I also set beta2 to 0.1, the batch size is 100. Thanks