RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
242 stars 44 forks source link

ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs #65

Closed jolespin closed 3 years ago

jolespin commented 3 years ago
(vamb_env) -bash-4.1$ vamb --fasta mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta --jgi coverage_output/coverage_metabat2.tsv  --outdir vamb_output
Traceback (most recent call last):
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/vamb_env/lib/python3.6/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Here's what the output of jgi_summarize_bam_contig_depths looks like:

(vamb_env) -bash-4.1$ head coverage_output/coverage_metabat2.tsv
contigName  contigLen   totalAvgDepth   sorted.bam  sorted.bam-var
NODE_1_length_581408_cov_10.671907  581408  17.0497 17.0497 24.4313
NODE_2_length_212490_cov_11.140151  212490  17.7493 17.7493 26.3056
NODE_3_length_56611_cov_10.039571   56611   16.0747 16.0747 24.7309
NODE_4_length_52215_cov_10.245380   52215   16.4059 16.4059 20.9325
NODE_5_length_49788_cov_11.464963   49788   18.376  18.376  28.3959
NODE_6_length_44487_cov_9.390124    44487   15.069  15.069  20.5564
NODE_7_length_41442_cov_10.399425   41442   16.6383 16.6384 22.4833
NODE_8_length_37801_cov_9.536534    37801   15.3226 15.3226 25.3435
NODE_9_length_28654_cov_10.767824   28654   17.234  17.234  22.4427

It's the right number of rows too (n-1 for the headers)

(vamb_env) -bash-4.1$ grep -c "^>" mage_output/M-1507-133.A/intermediate/assembly_output/scaffolds.fasta
25728
(vamb_env) -bash-4.1$ wc -l coverage_output/coverage_metabat2.tsv
25729 coverage_output/coverage_metabat2.tsv

Here's the version:

(mage_env) -bash-4.1$ conda list | grep "vamb"
vamb                      3.0.2            py36hc5360cc_1    bioconda
simonrasmu commented 3 years ago

Hi Jolespin,

This is strange, what happens if you use use -m to set minimum contig length to 2000? Can you also paste the log-file?

Second, you are running on a single sample? If you have more samples you should assemble them one-by-one and then run Vamb all samples simultaneously (multi-split mode).

Best,

Simon

jolespin commented 3 years ago

After looking into the documentation a bit more I'm thinking this tool might not be the best for my current purpose since these are subsets of reads that I've assembled and there are far fewer contigs than 50k (~20k w/ about 500 long contigs).

However, the fact that you can assemble genomes separately and bin them w/ multi-split mode is absolutely amazing and will DEFINITELY be using this in the future for larger projects. Is there a tl;dr for the multi-split mode? For example, does it make consensus bins from all the samples and, if so, how does it reduce the redundancy?

Thanks again and I'm looking forward to integrating your package into the workflows at my institute.

jakobnissen commented 3 years ago

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

simonrasmu commented 3 years ago

Ok sounds good!

I think Vamb could work with fewer samples and contigs (even though we only tested with 50,000 contigs). If you are interested we could have a go at your data.

Best,

Simon

jolespin commented 3 years ago

Thanks for using Vamb.

Multi-split is really dirt simple. After assembling individual samples, they are binned together. We then simply split each bin by sample - literally we just take all the contigs in bin 1 from sample 1 and put it in bin_1_1, contigs from bin 1 in sample 2 in bin_1_2, etc. So there is no reduction of redundancy, you get the same genomes duplicated if they are present in multiple samples.

Ah that makes even more sense now that I think about it. For a bit I was doing coassemblies of all my samples but what I realized was that the genome bins that are pulled out were not necessarily biologically true and more a representative of the population. The fact that the bins are maintained through out and binned together is a good idea because you can 1) have something to cross reference between samples although the actual bins are different; 2) have bins that are more true to the real biological sequence; and 3) if the user desired to create a coassembly (let's say they want to create an abundance table of contigs instead of otus), they could simply map to the multisample bin and then reassemble.

Are the intermediate bin assignments stored in multi sample mode?

jakobnissen commented 3 years ago

Sort of. It's not stored, but the final bin name is named e.g. sample1_1 - depending on the names of your contigs - e.g, given the name of an output bin, you can always get the sample and the original bin.

jolespin commented 3 years ago

I reran this on another dataset and it worked fine:

COV=../Assembly/883.abundance.metabat2.tsv
FASTA=../Assembly/883.fasta
OUT=883/vamb_output
qsub -cwd -P 0594 -N job_883_vamb -j y -o job_883_vamb.o "source activate binning_env && vamb --outdir $OUT --fasta $FASTA --jgi $COV  --minfasta 200000"

Not sure if this will help anyone but I lost access to my raw reads and bam files b/c I'm ressurecting an old project. To run this, I made my own JGI formatted file from the metaSPAdes coverage info:

(binning_env) -bash-4.1$ head ../Assembly/883.fasta
>NODE_1_length_850338_cov_9.99141_ID_1
TCTTGTTATTTTTCTTTACAACAGATGTGCTTATTTCACCCATCATTAGGGTAATTAACT
CCTTATCCTTTACAGATTCTGTTTTAAGGTCTTCAGCAACCATTTGACCATTTTTCATAA
CAAAGATCGTTTCAGTAAATTCTTTTATTTCTTTTAATTTGTGAGTGATAAATACGATTG
TTTTTTCATCTTCTTTTGAAAGAGTCTCAAGTGAATTTTTTAGCTGGTTTGTTTCTTGAG
GTGTTAAAACAGCAGTAGGCTCATCAAGCAACATAATGCTGTTGTTATTGAATATTAATT
TCATGATTTCTATTTTTTGTTTTTCTCCTACTGAAAGATCAGAGATAATAGAGTCATGAT
TTACTGATAAAGAAAAAATTTCAGAATATTTTTTATATGCATCTTCAAAATCATTAATGT
AGATATTTGCCTTTGATAGGGTCAAGTTATCTTTAATAGTGAAATTTTCAATTAATCGAA
ATTCCTGATGAACCATACCAATTCCAAGTTTTGCTGCTAATTCAGGATTAAGCTTTTTTT
(binning_env) -bash-4.1$ head ../Assembly/883.abundance.metabat2.tsv
contigName  contigLen   totalAvgDepth   metagenomics    metagenomics-var
NODE_1_length_850338_cov_9.99141_ID_1   850338  9.99141 9.99141 0
NODE_2_length_696952_cov_6.08719_ID_3   696952  6.08719 6.08719 0
NODE_3_length_645813_cov_1.38757_ID_5   645813  1.38757 1.38757 0
NODE_4_length_618535_cov_4.37412_ID_7   618535  4.37412 4.37412 0
NODE_5_length_518114_cov_4.40507_ID_9   518114  4.40507 4.40507 0
NODE_6_length_481955_cov_1.4344_ID_11   481955  1.4344  1.4344  0
NODE_7_length_435255_cov_2.77715_ID_13  435255  2.77715 2.77715 0
NODE_8_length_428820_cov_3.7035_ID_15   428820  3.7035  3.7035  0
NODE_9_length_425291_cov_2.13358_ID_17  425291  2.13358 2.13358 0

Closing this issue.

HaraldBrolin commented 3 years ago

Hi, I found your VAMB algorithm very interesting and I wanted to evaluate the algorithm on a current project of mine. But I'm running into the same error as jolespin, in his original post. The failure occurs roughly ~15 mins into the run.

    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

I'm evaluating previous Canopy results vs Vamb and I want to use the same input data for the algorithm to minimize differences, thus I've created my own JGI formatted file from the previous analysis. I'm running on a small subset of the data, only 10 samples, to evaluate resource allocation for the larger run of ~1,200 samples. I'm also using a filtered version of my JGI-file where I have removed low abundant genes. The totalAvgDepth is identical to the first sample in my JGI-file.

The command that I'm running using the snakemake script:

vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

zgrep "^>" contigs.flt.fna.gz | wc -l
15186403

wc -l jgi_matrix/jgi.abundance.dat
14143390 

This is the current VAMB version that I'm using: vamb 3.0.2 py37h73a75cf_1

his is how I've formatted the JGI input:

head jgi_matrix/jgi.abundance.dat 
contigName  contigLen   totalAvgDepth   s_4026  s_4027  s_4023  R0921   R0920   R0923   R0922   R0925   R0924   R0927
R0592_scaffold6566_3_gene27949:length_1518:complete 1518    21.26   21.26   290.98  177.74  32.04   26.76   72.38   6.18    65.84   131.28  13.58
R0592_scaffold6566_3_gene27957:length_489:complete  489 2.84    2.84    66.5    38.76   12.46   9.96    17.86   1.68    12.06   28.26   0.5
R0592_scaffold6566_3_gene27958:length_1560:complete 1560    14.74   14.74   190.18  73.28   41.24   20.78   36.44   7.4 37.98   81.96   9.24
R0592_scaffold6566_3_gene27959:length_609:complete  609 8.1 8.1 103.7   45.44   7.62    5.34    22.26   2.34    15.04   29.36   1.0

Here is what the --fasta input contigs.flt.fna.gz looks like:

zless contigs.flt.fna.gz | head -n 50
>R0592_scaffold6566_3_gene27949:length_1518:complete
ATGAAAAAATCGAAGGAAAAAATAAAAAAGATCCACAGTCTCCGCCATCAGATCACAGCA
TACTTTATCGGTCTGCTGTTTTTATCCATTCTGACTATTACCGTGATCAACGGAGCCTTT
CTTGAAAAATATTATGTGACAAAAAAGATCGATGTACTTCTGAATCTCCGAACAACTTTG
GAAAATACAGATATTGACAAGATGATGAACAGCAACAGTTCGAAAGGTATGGAGAGTATT
CCCGATGAGATCCAGAGAGCCTGCTCCAGAAACAATCTTTCCTGGGTTATCATTGACAGC
AGCAATACCAGCTGGCTTTCCTGGGGCGAAAATGAAAAAATGCTTCAGAGCAAGCTGTTC
GGATATGTCTATGATCTGGATGAGGACGGAGATAAGAGCCGGACCTTGAAGCAGGGTGAC
AATTATACCATCCAGCAGTCGCATGACCGTTTTGCGGGAATGGATTATGTGGAATGCTGG
GGACAGCTGGATGAGGAACATTATTTTATCATAAGGACACCCCTTGAGAGCATCCGGGAG
AGCGCTGATATTTCCAATAAGTTTTACTTTGCGGTAGGTCTGATGATCATAATCTTAAGT
GGTCTGGTCATTATGGTCGTGACCAGGAGGATCACCCGTCCCATCAGTGAGCTTACGGAA
CTTTCCAGAAAAATGTCAGATCTGGATTTTGAGGCAAAATACGAGAGCAAGGTCGGCAAT
GAGATCGATGTTCTGGGAGATAATTTCAATCGGATGTCCTCACAGCTGGAGACTACCATC
AGTGAACTGAAATCTGCAAATAATGAACTGCAGCGTGACATTGAGGACAAGATCAAGATC
GACAAGATGCGAAAGGAATTTCTGGATAACGTGTCCCATGAACTGAAAACTCCCATAGCC
CTGATCCAGGGCTATGCAGAAGGCTTAAAAGAAAATATTTCTGATGATCCTGAGAGCAGA
GAGTTCTATTGCGATGTCATTATGGATGAAGCCTCCAAGATGAATAAACTGGTAAAAAAT
CTTCTCACACTGAATCAGCTGGAATCCGGCAAAGATGCAGTTGTAATGGAGCGGTTTGAT
ATCGTATCCCTGATCCGCGGCGTGCTTCAGACCATGAATATCATGATCGGGCAGAAGAAT
GCAAAAGTGATCTTTGAGGCGGAGAAGCCGGTTTATGTATGGGCAGATGAGTTTAAGATC
GAAGAGGTTGTGACAAACTATGTAAGCAATGCCCTGAACCATCTGGAGGGAGAGAAAGAA
ATCGAGATCAAGCTGCAGGATGAGGGTAACCGTGTGAAGATCAGTGTGTTTAACACAGGG
ACTCCGATCCCTGAGGCAGATGTACCGAATCTCTGGAATAAATTTTATAAAGTTGATAAA
GCGAGAACCAGGGAATACGGCGGAAGCGGTATCGGTTTATCCATTGTAAAGGCCATCATG
GAAAGCATGAACCAGGAATATGGCGTACAGAACTTTGATAATGGCGTGGAATTCTGGTTT
ACACTGGACCGGAAATAA
>R0592_scaffold6566_3_gene27957:length_489:complete
ATGCTGGATCTGGAAGAACGAAACCGTGCAAGAAAGAAAGCCATGCGGCTTCTGGAACAT
ATGGACCGCACAGAGAAAGGCCTGACGGATAAACTCCGTCAGGCTGAATTTTCACCGGAG
GCAGTAGAAGATGCCATTGCTTATGTGAAGTCCTATGGGTATATCAACGATGCCCGTTAT
GCCAGAACCTATATCTCCTTCCGTATGGAATCCAGAAGTCGTCAGAAGATCCTCTCGGAG
CTTCAGTTAAAGGGTGTGGACCGTCAGACTGCCCTGGATGCCTGGAATGAGATGGAAGAG
CTTATGGAACCGGACGAGCGGGAGGTATTACGCAGGACGATCGAAAAGAAATATGCGCCG
GACACAGAGCTTGATGAAGGACAGATGCGAAGGCTGTATGGATATCTGGCCAGACGAGCA
TTCCGGTATGAGGATATTTCCCATACCCTGGAAGAAATGAACATCCGGGTGAACCGTGAT
ATGGAATAA

This is the log/vamb/vamb.log:

Traceback (most recent call last):
  File "path/bin/vamb/vamb_16_04_21/bin/vamb", line 11, in <module>
    sys.exit(main())
  File "path/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 528, in main
    logfile=logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 247, in run
    len(tnfs), minalignscore, minid, subprocesses, logfile)
  File "path/b2014011/bin/vamb/vamb_16_04_21/lib/python3.7/site-packages/vamb/__main__.py", line 121, in calc_rpkm
    raise ValueError("Length of TNFs and length of RPKM does not match. Verify the inputs")
ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

Best, Harald

jakobnissen commented 3 years ago

Dear @HaraldBrolin

The error comes because each contig needs both a TNF (which is obtained from the FASTA file), and an RPKM (which is obtained from the JGI input file). To fix the problem, you need to remove the sequences in the FASTA file for which you don't have entries in the JGI depths file.

The JGI file does not seem to be correctly formatted, either. It should look like this.

HaraldBrolin commented 3 years ago

Thanks for the quick reply @jakobnissen. I'll try to fix the issues and write an update.

HaraldBrolin commented 3 years ago

Hi @jakobnissen,

I've matched the genes/contigs in both the jgi.abundance.dat and the FASTA file (contigs.flt.fna.gz).

md5sum of genes/contig-names after removing the initial ">".

zgrep '^>' contigs.flt.fna.gz |  awk '{print substr($1,2); }'  | md5sum
29180c62cc3bea4879a62634c244b236  -

md5sum of contigNames after dropping the header.

tail -n +2 jgi.abundance.dat | awk {'print $1'} | md5sum
29180c62cc3bea4879a62634c244b236  -

Regarding the format of the jgi.abundance.dat, the only difference I can detect between my jgi-file and the file from this link: here, is the contig length and the variance-column. I can't do much about the lengths, but if I understood correctly Vamb can run on genes. I've added the variance column and as a test I've only kept one sampe.

contigName  contigLen   totalAvgDepth   sorted.bam  sorted.bam-var
R0592_scaffold6566_3_gene27949:length_1518:complete 1518    37  37  0
R0592_scaffold6566_3_gene27957:length_489:complete  489 5   5   0
R0592_scaffold6566_3_gene27958:length_1560:complete 1560    25  25  0

I'm running the same command as previously: vamb --outdir vamb --fasta contigs.flt.fna.gz --jgi jgi_matrix/jgi.abundance.dat -o C -m 2000 --minfasta 500000 --outdir vamb 2>log/vamb/vamb.log

And I'm getting the same error message as previously: ValueError: Length of TNFs and length of RPKM does not match. Verify the inputs

simonrasmu commented 3 years ago

Hi @HaraldBrolin and @jakobnissen

Chipping in on this.

  1. It is perfectly fine to run VAMB on genes - we did this in the preprint but replaced this with running on MAGs for the published paper. However, when you use the -o C flag you make it split the genes (or contigs for that matter) as if it was a combined assembly (which I assume it is not). E.g. -o C assumes that the contignames are as "SampleIDCcontigID", ie. it splits Samples and contigs (using the character "C") for when running multi-split mode. If your contig/gene names are not set up like this it will not work.

  2. You should run without -m 2000 as this removes contigs/genes below 2000bp (most of your genes will be).

Best,

Simon

HaraldBrolin commented 3 years ago

Hi @simonrasmu

I tried changing the settings in the Snakemake but I still end up with the same error message, I think it's time for me to try another approach. I wanted to avoid re-mapping my samples but I think I have to do it anyways.

Thanks for your support!

Best, Harald

jolespin commented 3 years ago

Complete side note, but I ended up running this on an ocean metagenome using the following hyper parameter grid:

I also used maxbin2 and metabat2. I combined all of these results as input into DAS_Tool and got great results. vamb was able to recover bins that weren't found via maxbin2 and metabat2.

Not sure if this will help anyone in the future.

simonrasmu commented 3 years ago

Sounds great - it is very nice that the methods are complementary and can be used to generate additional MAGs.