RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
255 stars 46 forks source link

Test on MetaHIT dataset #77

Closed zpf0117b closed 3 years ago

zpf0117b commented 3 years ago

Hi, We tried the code of VAMB as provided in https://codeocean.com/capsule/1017583/tree/v1 and found that the command for the experiment on MetaHIT dataset was a little different: There was no separator in vamb --outdir results/metahit --fasta data/metahit/contigs.fna.gz --rpkm data/metahit/abundance.npz --cuda. The annotation of this command says this command runs VAMB without multi-split (as here we had pooled assemblies) on MetaHIT dataset.

I wonder:

  1. Why you didn't add separator to this command? I'm confused about the "pooled assemblies" you mentioned.
  2. How can we run VAMB on MetaHIT dataset on single-sample mode and on multi-split mode as you did in Supplementary Figure 18-19 of the paper Improved metagenome binning and assembly using deep variational autoencoders (https://www.nature.com/articles/s41587-020-00777-4)?
  3. In the same paper, you reported the number of genomes(at strain level)/species/genera reconstructed with a precision of at least 95% using VAMB on MetaHIT in Supplementary Table 2, 4, 5. Did you obtain the results by running the command I referred above?

PS: We tried this command vamb --outdir results/metahit --fasta data/metahit/contigs.fna.gz --rpkm data/metahit/abundance.npz -o ref --cuda (i.e., added argument -o ref as separator) on MetaHIT dataset and got the following result:

Recall Prec. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.99 0.3 98 76 71 58 47 32 20 9 0 0.4 98 76 71 58 47 32 20 9 0 0.5 98 76 71 58 47 32 20 9 0 0.6 98 76 71 58 47 32 20 9 0 0.7 98 76 71 58 47 32 20 9 0 0.8 98 76 71 58 47 32 20 9 0 0.9 98 76 71 58 47 32 20 9 0 0.95 98 76 71 58 47 32 20 9 0 0.99 98 76 71 58 47 32 20 9 0

Recall Prec. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.99 0.3 80 63 58 49 40 27 17 8 0 0.4 80 63 58 49 40 27 17 8 0 0.5 80 63 58 49 40 27 17 8 0 0.6 80 63 58 49 40 27 17 8 0 0.7 80 63 58 49 40 27 17 8 0 0.8 80 63 58 49 40 27 17 8 0 0.9 80 63 58 49 40 27 17 8 0 0.95 80 63 58 49 40 27 17 8 0 0.99 80 63 58 49 40 27 17 8 0

Recall Prec. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.95 0.99 0.3 45 33 31 27 25 19 13 7 0 0.4 45 33 31 27 25 19 13 7 0 0.5 45 33 31 27 25 19 13 7 0 0.6 45 33 31 27 25 19 13 7 0 0.7 45 33 31 27 25 19 13 7 0 0.8 45 33 31 27 25 19 13 7 0 0.9 45 33 31 27 25 19 13 7 0 0.95 45 33 31 27 25 19 13 7 0 0.99 45 33 31 27 25 19 13 7 0

jakobnissen commented 3 years ago

Why you didn't add separator to this command? I'm confused about the "pooled assemblies" you mentioned.

Separating the bins based on their sample of origin is only possible if the sample of a given contig can be known. In the case of the MetaHIT dataset, the contigs did not come from individual samples. I seem to recall the contigs were actually sampled from individually sequenced genomes.

How can we run VAMB on MetaHIT dataset on single-sample mode and on multi-split mode as you did in Supplementary Figure 18-19 of the paper?

You should run VAMB 3.0.1 with default parameters. In single-sample mode, simply give it a single BAM file. In multi-split mode, give it multiple BAM files, and also a "binsplit separator".

In the same paper, you reported the number of genomes(at strain level)/species/genera reconstructed with a precision of at least 95% using VAMB on MetaHIT in Supplementary Table 2, 4, 5. Did you obtain the results by running the command I referred above?

Yes, precisely.