RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
244 stars 44 forks source link

Roadmap to v3 #30

Closed jakobnissen closed 2 years ago

jakobnissen commented 4 years ago

Roadmap to version 3

After having more experience with Vamb and gotten a better theoretical understanding of how it works, I have discovered a few areas where Vamb could potentially be improved to be more accurate.

New model for encoding

VAEs are actually fairly bad at clustering. The problem is that the prior forces the contigs into one central group near the origin. Furthermore, it is disappointing that even though the VAE should first infer the genome from the contig and then sample a contig from the genome, it does not just return the inferred genome identity. This is remedied somewhat in the unsupervised clustering adversarial autoencoder (AAE), and apparently improved even more in Ge et al.'s elaborate dual AAE setup. Pick a model that is fairly well established, easy and stable to train, and automatically clusters.

If the model requires you to specify the number of clusters beforehand, a generous estimate can be given. Only contigs with a high clustering identity threshold should be clustered (see Ge et al). In a second step, clusters representing half genomes can then be merged - this should be much easier than clustering contigs. We can also recruit the un-binned contigs with ordinary means like alignment.

Projection of TNF to lower-dimensional space

The different TNFs are linearly dependent on each other, see Kislyuk et al 2009. In particular, we expect that:

These are not true for finite-length contigs, but the only reason they are not is because any finite contigs are approximations of the underlying ground TNF which we might as well estimate.

This means there is only 103 independent TNFs. Projection to 103-dimensional space can be done using a PCA with 103 dimensions, but that makes it dataset-dependent. Alternatively, do as in Kislyuk et al. and make a matrix containing the linear equations governing the contigs (a null space), then calculate the kernel. That will be dataset-independent.

When you do, remember that you should NOT normalize after projection. If you change the variance, the model will spend a lot of energy attempting to reconstruct previously low-variance dimensions, essentially noise.

It is very, very easy for a neural network to learn this linear relationship, so you basically only save some memory. But hey, why not?

Review reconstruction loss functions

In VAEs, AAEs or dual AAEs, reconstruction loss is a proxy for expected loglikelihood of the input contig given the inferred genome.

When reviewing the loss functions, you can randomly picked reference genomes and sample out contigs, normalize, then check if the distribution of the measure you minimize in the reconstruction error matches the observed likelihood distribution.

In particular, TNF get more precise as contigs gets longer. The NLL of the assumed Gaussian N(mu, sigma) scales with (1/2sigma). So estimate the sigma of the Gaussians as a function of the contig length, and scale the loss with 1/sigma.

jakobnissen commented 4 years ago

Done with the TNF projection, see the v3 branch. Memory savings of TNF is 33/136, around 1/4. No speed change in training time. Projection costs almost no time or memory. As a bonus, it appears to learn TNF more effectively (or at least faster). I'll take it.

jakobnissen commented 2 years ago

We have decided to optionally allow a VAE/AAE ensemble model, which I will track in a separate issue. Testing have shown that re-inventing the loss function is too much work for now, since it requires redoing the hyperparameters.