Training Objective - Githubissues

ghost commented 7 years ago

Hi, thanks for posting this code!

I'm not sure I understand the training objective you're using - is it a variational auto-encoder?

Is loss_seq the Kullback-Libeler divergence - line 141

and loss_lat_batch a reconstruction loss - line 113

If you've got a link to a paper or book which describes you code that would be really appreciated.

Thanks very much for you help,

Aj

RobRomijnders commented 7 years ago

Hi Aj,

So in this project I experimented with the loss function. Indeed, I drew inspiration from the variational auto encoder.-

However, in the variational auto encoder, we penalize the mu en sigma of each individual data point. As you can read in the paper. From a Bayesian modeling perspective, this makes sense. We model the generative process as starting with a sample from a unit Gaussian. During training, we maximize the lower bound, which corresponds then to the KL divergence between these samples and the unit Gaussian.

I wanted to experiment with this, because I couldn't understand why every point should have zero mean. I aimed for an interpretable latent space. And all vectors at zero isn't so interesting. Hence, I designed a loss function, that on average, puts all the latent points in a Unit Gaussian around the origin. I also motivated this from an information theoretic point of view. If we allow the auto encoder to use the entire latent space, it can store arbitrarily many information. By pushing the points towards a small area around the origin, we force the auto encoder to abstract only the most useful information from the sample.

ghost commented 7 years ago

Hi Rob,

thank you very very much for the kind explanation :+1:

I think I've heard of a similar idea of sampling from the unit Gaussian sphere in GAN models, but not sure I understood it. It does sound very reasonable, allowing a non-zero bias, and seems to produce a clear latent vector separation (judging by the tSNE plot). Is there any paper which describe this in mathematical detail, I think that would help me to be sure I understand.

I'm interested in multivariate behavioural and physiological streaming data, like activity levels, heart rate, temperature and sleep patterns and the like. Unfortunately there aren't many such multivariate data-sets available - but I'll try to extend your code to reconstruct multiple streams - there's a simple data-set in the UCR archive - uWaveGestureLibrary_X,Y and Z, which I'll try asap.

RobRomijnders commented 7 years ago

Have you seen these datasets? Here and here

I don't know of a paper describing this in mathematical detail, but the deep learning book features a good chapter on auto encoders

ghost commented 7 years ago

Hi Rob, thanks for the links!

I've applied for the first one of those datasets, hopefully I'll get access in a couple of days. May I ask if you've got any experience of working with MIMIC-III? This benchmark seems interesting though it's in Theano, and has some code to load the dataset.

I guess for a fresh/raw unlabelled multivariate dataset, a simple sequence-to-sequence (non-variational) auto-encoder would be the simplest place to start?

I'm working in PyTorch now as it's easier to debug than TF, there's a recurrent auto-encoder with attention, available here. I just need to modify it for real values, instead of discrete tokens/words, and experiment with different methods to make it's latent representation sparse/structured, like you've done in this repository.

As the book says, it's not really that interesting perfectly reconstructing the data, (i.e. using a VAE loss), rather finding a good compact/sparse representation/features that are useful for semi-supervised classification are my main goal.

RobRomijnders commented 7 years ago

No, I haven't worked with those datasets.

And yes, an auto encoder is probably your best start. But what is your aim with this?

ghost commented 7 years ago

As the data is "raw" and multi-modal and collected from individuals, the aim would be to in some sense separate/cluster the individuals in the latent space in those who remained healthy, and those who developed a disease.

For example, say we've collected multivariate behavioural and physiological streaming data, like activity levels, heart rate, temperature and sleep patterns and symptom scores for some disease, over a long period say a year. As an example say the disease is major depression. We could then subset these into weekly, or longer time-series and examine what are the characteristics of onset for the disease? At what point can we find/detect clusters in the latent space that later developed the disease, which are easily detectable? In some sense it's like using exploratory data analysis for an early detection problem.

Does that help - I hope it's clear?

RobRomijnders commented 7 years ago

Wow, that sounds interesting. I think it could be influential if you can make it work.

A small advise: I would advise you to also get a few labels. I know labels are hard to get in the medical world, but they might help you searching for clusters. I think the characteristics you are looking for are not dominantly present in the data. So any auto encoder trained with L2 loss might not necessarily pick up on these signals. Even though a small pattern could be present.

With some labelled data points, you could train a linear classifier on top of the encoder. Training with the gradient of this classifier will push apart the data samples from the different classes. That might make your latent space more interpretable.

For this, you might find the SSL with VAE paper interesting or the Ladder Network

ketyi commented 7 years ago

Hi @RobRomijnders,

Am I right that you are following the VRAE approach (https://arxiv.org/abs/1412.6581) with a modification where the sampling from the latent vector happens?

I still don't clearly understand the motivation. You are optimizing:

1.) For a minimal KL divergence between posterior on encoder and prior on z (which is the standard normal distribution) 2.) You are aiming to minimize the cross-entropy loss of the average of the samples from a standard normal distribution reparametrized by the posterior of the decoder and the input. (if I see it right)

RobRomijnders commented 7 years ago

Yes, that's right. The framework for Variational Auto encoders includes both those terms in the cost function. You can read on this in the original paper auto encoding variational bayes

If you find that is doesn;t train on your dataset. Then try to introduce the KL-cost term only later on in training. I think people have published blogs and papers on tips&tricks to train variational auto encoders

shubhamagarwal003 commented 6 years ago

Hi @AjayTalati, I am also trying a similar approach of clustering the points in latent space but on a different dataset. My dataset contains a multidimensional time series. The data is collected from accelerometer and gyroscope when users perform certain task (playing a mobile game). I want to determine clusters of different player behavior. If you could point me to any resources (code, papers, etc) that you followed for your problem then it would be really helpful.

Thanks.

tejaslodaya commented 6 years ago

To respond to @AjayTalati's question at the start, I would like to point out:

loss_seq is the reconstruction loss (binary cross entropy) between the reconstructed output and original input.
loss_lat_batch is the KL divergence loss.

Total loss is given by: screen shot 2018-06-05 at 12 34 21 pm

Code in pytorch:

BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), size_average=False)
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
total_loss = BCE + KLD

For more information -

See Appendix B from VAE paper: Kingma and Welling. Auto-Encoding Variational Bayes. ICLR, 2014 https://arxiv.org/abs/1312.6114

CC: @RobRomijnders @ketyi

RobRomijnders / AE_ts

Training Objective #3