RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
250 stars 45 forks source link

Error during RPKM loading for --jgi #28

Closed arghya1611 closed 4 years ago

arghya1611 commented 4 years ago

Hi I am getting the following error when vamb is loading RPKM from a jgi depths file. Can you please let me know what the issue is?

(/pylon5/cc5piap/arghyam/vamb) [arghyam@login018 megahit-all-analysis]$ vamb --outdir vamb-genomes --fasta megahit1000.fa --jgi metabat1-genomes/megahit1000.fa.depth.txt -p 16 Traceback (most recent call last): File "/pylon5/cc5piap/arghyam/vamb/bin/vamb", line 11, in sys.exit(main()) File "/pylon5/cc5piap/arghyam/vamb/lib/python3.7/site-packages/vamb/main.py", line 476, in main logfile=logfile) File "/pylon5/cc5piap/arghyam/vamb/lib/python3.7/site-packages/vamb/main.py", line 219, in run len(tnfs), minalignscore, minid, subprocesses, logfile) File "/pylon5/cc5piap/arghyam/vamb/lib/python3.7/site-packages/vamb/main.py", line 125, in calc_rpkm shutil.rmtree(dumpdirectory) UnboundLocalError: local variable 'dumpdirectory' referenced before assignment

Thanks.

jakobnissen commented 4 years ago

Dear @arghya1611

Looks like a silly mistake I made in the RPKM reading code. I've pushed a fix to the master branch. I'll test it tomorrow morning, but you can give it a spin now if you feel adventurous.

arghya1611 commented 4 years ago

Dear Jacob The problem seems to persist. Getting the same error again although after a bit of rpkm propagation.

vamb --outdir vamb-genomes --fasta megahit1000.fa --jgi metabat1-genomes/megahit1000.fa.depth.txt -p 24 -m 1000 Traceback (most recent call last): File "/pylon5/cc5piap/arghyam/vamb/bin/vamb", line 11, in load_entry_point('vamb', 'console_scripts', 'vamb')() File "/pylon5/cc5piap/arghyam/vamb/vamb/vamb/main.py", line 476, in main logfile=logfile) File "/pylon5/cc5piap/arghyam/vamb/vamb/vamb/main.py", line 219, in run len(tnfs), minalignscore, minid, subprocesses, logfile) File "/pylon5/cc5piap/arghyam/vamb/vamb/vamb/main.py", line 125, in calc_rpkm shutil.rmtree(dumpdirectory) UnboundLocalError: local variable 'dumpdirectory' referenced before assignment

Please let me know what you find on your end. Thanks.

jakobnissen commented 4 years ago

It looks like you are still running the version with the bug. To get the new version, re-download Vamb (or use git pull), and checkout the master branch.

jakobnissen commented 4 years ago

Re: Your initial comment (which I can't seem to find now, perhaps it's deleted)?

A conservative guess it that it will take 15 hours, but it really depends on the number of samples, your hardware, the number of CPUs you give it, the speed of your storage, the BLAS library your PyTorch is using and so on. On our cluster with 8 slow-ish CPU cores it takes about 3 hours for the CAMI2 Oral dataset (200k contigs). By far the most time is spent on the training, which scales linearily with number of contigs. So from that I'd estimate 8 hours. Double it to be more conservative and it's 15 hours. Although see issue #27.

Vamb itself outputs the bins in the clusters.tsv file. Yes, you need an extra step to generate FASTA files from these clusters. It would be very easy to make Vamb automatically generate bins from, but I actually don't think it's a good idea, for a few reasons:

  1. Not all people need these files, and they could take up a lot of disk space. For many applications, just knowing which contigs belong to which clusters is good enough. Creating a multi-gigabyte directory would just be a nuisance to these people

  2. Unlike other binners, Vamb outputs all bins, even the single-contig ones. This means you could easily end up with 100,000 bins, the large majority of which are artifacts. Other binners produce these fragmented clusters, too, but usually don't output them. The choice then becomes to either not output small clusters, which would make Vamb useless for people fishing for plasmids and viruses, or else risk creating 100,000 files in someone's filesystem, which can cause distributed compute clusters to behave very poorly to the annoyance of everybody.

  3. This is less important than the other reasons, but the only efficient way of generating the bins would be to load in the sequences in memory. This could quickly consume all available memory if you're not careful.

For these reasons, it's better if the user just creates the bins themselves. Of course, I'd want it to be as smooth for the user as possible. Currently, they have to run the file src/create_fasta.py from command-line. That's pretty easy, but it can be annoying to have to request more time from a cluster after the initial Vamb run. Perhaps I could add an option to Vamb like this:

--minfasta
minimum bin size to output as fasta [None = no output]

The user can then specify, for instance --minfasta 200000 to create a directory with all bins with more than 200,000 nucleotides. This way, the user would have to actively set an option to create FASTA bins, and would likely not be surprised by 100,000 small files. Would that option be useful to you?

arghya1611 commented 4 years ago

Yes, I deleted the comment to maintain continuity for the error. Sorry about that.

I used a 28 core (Intel Haswell; 2.3-3.3 GHz) node with 125 Gb RAM to run my 2 Giga base pair contig file with around 500,000 contigs. The run ended with the following outputs:

Clustered 224003 contigs in 180013 bins
Clustered contigs in 611.93 seconds

The total run took 14860.74 seconds, or around 4 and a half hours to complete. In my experience only MetaBat2 performs faster than vamb. For the same dataset, Metabat1 took nearly 8 hours and a Maxbin2 run had to be terminated as it continued to run beyond 48 hours. I never tried CONCOCT as it is known to take up a lot of runtime with results equal/inferior to Maxbin2. I am thinking of trying abawaca and MetaGen, but unlike vamb, the quality of help and documentation is quite poor. So, performance wise vamb is quite competitive. I am waiting on my checkm results for the binned genomes for a final comparison with other binners.

All of the binners I have used have the minimum contig size cutoff (set generally at 200,000) , which like you said is very handy as accurate downstream annotation of the bins requires a larger genome size. Adding an argument under the output section of vamb with options for the following will be very useful.

output bins=True/False (Default=True)
bin directory <PATH>
minimum bin size <INT> (Default=200000)
unbinned contigs output= True/False (Default=False)

It would be great if you could also add the minimum length and unbinned argument to the create_fasta.py script, as it will allow the script to be used in isolation as an utility script for parsing results in isolation and to maintain consistency.

Thanks for your support and a great binner. Hope the inputs help.

jakobnissen commented 4 years ago

Alright, I'll add an argument to output bins. The concept of "unbinned" contigs makes little sense, though, unless you define it as "all contigs in bins smaller than the specified cutoff size". Actually that's probably a good idea to add.

Re: Speed of Vamb, it's a matter of scaling. Vamb is very slow for small datasets, but scales better than any other method I know of. For very large dataset (millions of contigs), it's the fastest binner available. Canopy is the fastest for small dataset, it can run in seconds, but the quality of output is quite poor.

Thanks for the user inputs. I'll close this issue, but you're of course welcome to open another one (or re-open this if the original issue was not properly fixed).