How to incoporate known dissimilarity or undesirable batch info?

yamada321 commented 1 year ago

Hi,

Thank you for the wonderful tool. I wonder if it is possible to modify VAE's loss function the following way.

Say in a binning job, as auxiliary information from alignment, we know that some contigs are from similar strains or samples, with similar abundances. We assume that this information is independent from TNF and abundances, therefore is not obvious to VAE if not explicitly given. It also contains only dissimilarity, and says nothing about similarity. Intuitively, this seems doable if we train VAE with paired contigs (penalize if we know they are dissimilar, otherwise use regular loss function). After a brief search, this seems to fall within the studies of pairwise learning, and before that, metric learning (e.g. Mahalanobis).

For vamb as well as several other binners, if operating on phased long read based assemblies, they seem to have the tendency to create complete and highly contaminated bins. Post-processing bins based on the said auxiliary info is not straightforward. But there might be a chance to resolve this if done in the latent space. I'm not sure if this gut feeling is real or not.

Have you tried something along this line, or how would you feel about it?

Also, for clustering in the latent space, have you looked at tSNE or umap (I'm aware clustering with tSNE might be weird)? I ask this because I took vamb's TNF profile and run it through a simple VAE with the modified loss function. To avoid evaluating all clusters with checkM, I took the latent space to tSNE and see if obvious clusters evaluates to near-complete bins. This did not work out well, and in fact the VAE's per epoch loss quickly staggered & did not drop much. The training was very barebone (no minibatch, just shuffle and iterate), so I wonder if it's bad training, or the latent space actually doesn't cluster will with tSNE.

Thank you very much.

jakobnissen commented 1 year ago

Dear @yamada321

Say in a binning job, as auxiliary information from alignment, we know that some contigs are from similar strains or samples, with similar abundances [ ... ]

Sorry but I don't understand what you mean. Are you asking if it would theoretically be possible to enhance Vamb in the future, by adding information from aligning the contigs to a database of known organisms? Yes, that should be possible and would indeed be a major improvement to Vamb. SemiBin does this, and in general creates better bins than Vamb (though, in our testing, slightly worse bins than our complete vae-aae Snakemake workflow you can find in the workflow_avamb directory).

Indeed, we have just begun work to add alignment information to binning, and have a working prototype which shows remarkable improvement. Unfortunately, in my experience making a robust tool is 10x harder than making a working prototype so it might take some time before we release it (as Vamb v5 or whatever).

For vamb as well as several other binners, if operating on phased long read based assemblies, they seem to have the tendency to create complete and highly contaminated bins.

I haven't seen that myself but I bet it's true. Binners are largely benchmarked and developed with short read assemblies in mind, which have other properties than long read assemblies, so their performance might be worse for long read assemblies. This is changing: Long read technology has been on the horizon for 15 years and perhaps now is cost-effective compared to Illumina. Its advantages for metagenomics are obvious. GraphMB - which is built on top of Vamb - has been tuned for Nanopore metagenomics assemblies, and outperforms Vamb for this kind of data.

I myself is convinced that Nanopore or HiFi reads or something like it will completely replace Illumina for metagenomics before the end of the 2020s, so we should begin developing long read tooling now.

Post-processing bins based on the said auxiliary info is not straightforward. But there might be a chance to resolve this if done in the latent space. I'm not sure if this gut feeling is real or not.

I don't think postprocessing should be done in the latent space. If the information is in the latent space, then the binner itself should do a better job of binning. Ideally, binning should move towards a two-step approach where it constructs rough bins by clustering, then refines them (perhaps iteratively). Fundamentally binning is a clustering problem necessitating a clustering step, but bin refinement is much easier after the first draft set of bins have been created - for example, phylogenetic analyses are much more powerful than the signals we use for binning, and can be used for bin refinement, but not really for clustering.

Overall, I think binning techniques currently are pretty primitive and that we, as a field, could do much better, for example by

Integrating the assembly graph
Integrating phylogenetic information in clustering and automatic bin refinement
Using deep learning more intelligently than Vamb does for the clustering step
Optimising for long, high quality reads, like Hi-Fi

Also, for clustering in the latent space, have you looked at tSNE or umap?

If I understand correctly, neither tSNE nor UMAP are capable of clustering? As far as I know, they reduce the number of dimensions, that's it. Clusters are indeed visible in the latent space when using UMAP or PCA. Unfortunately, the nature of the Vamb latent space means the contigs don't lie on a neat lower-dimensional manifold. Indeed, the Kullback-Leibler divergence loss in the VAE incentivises the network to make use of all available dimensions, so I'm skeptical that dimensionality reduction is a good idea.

yamada321 commented 1 year ago

Thank you very much for the reply!

Are you asking if it would theoretically be possible to enhance Vamb in the future, by adding information from aligning the contigs to a database of known organisms?

Yes, but the tricky part is that the info only contains dissimilarities and can't assume similarities. For example, we may know contig a and b are from two related strains, therefore they need to be not placed in the same bin. Similarly for contig b and c. However, this does not say anything about the relationship between a and c.

I have the same issue with integrating assembly graph into the binning. It seems that binners who make use of assembly graphs usually take it as a source indicating similarity. However, in long read assembly especially those attempts phasing, if we have a simple bubble structure such as E={(a,b), (a,c), (b,d), (c,d)} and V={a,b,c,d}, this ~~actually indicates~~ might suggest dissimilarity info for binner: maybe a and d are collapsed regions between haplotypes and we shouldn't put b and c in the same bin. When there is nested bubbles and wrong edges, I'm afraid examining topology alone won't suffice.

I'm not sure if HiFi phased assembly is uniquely hard to bin, and other data types do not have this issue. First few lines of vamb (not github head; ) and checkM1 look like:

Bin lineage Completeness    Contamination   heterogeneity   genome size # contigs
5   k__Bacteria 100 212.61  39.79   7732158 72
1602    root    100 232.39  94.81   7216099 144
147 k__Bacteria 100 369.69  31.21   14208740    171
1412    root    100 467.19  95.75   13606403    426
1319    root    100 274.1   98.69   4208226 61
218 k__Bacteria 98.31   1.39    27.27   2961013 1

Similar for graphMB and semibin. Metabat2 can turn out to be less so by making many single contig bins. Metabat2 bin is very likely wrong when it recruits more than 3 contigs.

bin refinement

I found it weirdly not straightforward to split contaminated bins without knowledge of downstream evaluation (i.e. no single copy gene info), but now I've lost the showcases. Could be that I tried the wrong way. Thank you for the comments.

tSNE and umap clustering

Sorry, I meant visual cues. I probably had issues training my primitive VAE model. The following is not directly related to vamb, but I would appreciate suggestions:

I'm very new to pytorch, so one sanity check was to take the MNIST, flatten (28, 28) images to (784,) vectors. Training expectation is that, if tSNE's 2D embedding of this raw data visually does not contradict with data's labeling, then tSNE's 2D embedding of a VAE latent space of the data should be similarly so.

It worked out as such for MNIST. When I moved to do it on tetra-nucleotide profiles however, the VAE seems to have failed to be trained(?) from this perspective. My VAE was very barebone, just fully connect encoder hidden layers, reparameterization, and again fc decoder hidden layers. The loss function is simply cross entropy for reconstruction error and KL divergence for regularization. Adjusting minibatch size and/or learning rate didn't seem to fix it.

Maybe I need a more sophisticated model and loss like vamb's to extract any useful information? Not trying to compete with vamb's binning power, I just want to try and see if adding (dis)similarity penalties to loss function could make any difference...

RasmussenLab / vamb

How to incoporate known dissimilarity or undesirable batch info? #153