RasmussenLab / vamb

Variational autoencoder for metagenomic binning
MIT License
233 stars 42 forks source link

vamb for PacBio long reads assembly strange results #132

Open jianshu93 opened 1 year ago

jianshu93 commented 1 year ago

Hello Vamb team,

With the same parameters for short reads binning, I used it also for a PacBio long reads sequencing project. It turns out that nearly each long contig was considered a bin (several thousand) by vamb while for GraphMB, which takes into account assembly graph, only about 100 bins are generated, which is consistent with Concoct+MaxBin2+Metabat2+DAS_tools (also about 100). I am wondering what is the problem with much longer contigs.

Thanks,

Jianshu

jakobnissen commented 1 year ago

That's interesting. We have begun benchmarking against synthetic PacBio reads and do find that Vamb performs well, so this is surprising. It could be overfitting of Vamb's network - but then it's weird GraphMB does not overfit.

I doubt it's long contigs, since Illumina assemblies also can create long contigs, and these bin just fine.

In general, I would expect GraphMB is superior to Vamb on long-read data. It's an extension of Vamb that is tuned for Nanopore reads, and which also include the assembly graph information.

simonrasmu commented 1 year ago

It could also be a case of the depth, ie how many samples and what is the average depth of the contigs? This could be low if the number of pacbio reads are small and mess up the abundance estimations.

echoduan commented 1 year ago

How to run vamb command for PacBio long reads assembly?

I run the below command:

sample=lichen2 threads=128 minimap2 -d ${sample}_rena.contigs.mmi ${sample}_rena.contigs.fasta minimap2 -t 28 -N 5 -a --split-prefix mmsplit -t ${threads} ${sample}_rena.contigs.mmi ../${sample}.fasta.gz 2> 2-bam/${sample}_unsort.log | samtools view -F 3584 -b --threads ${threads} -o 2-bam/${sample}_unsort.bam prefix=vamb_bin/${sample} rm -rf vamb_bin mkdir -p vamb_bin vamb --outdir ${prefix} --fasta ${sample}_rena.contigs.fasta --bamfiles 2-bam/${sample}_unsort.bam -o C --minfasta 200000

But there is an error message:

[E::idx_find_and_load] Could not retrieve index file for '2-bam/lichen2_unsort.bam' Traceback (most recent call last): File "/public/home/acq7wsloil/anaconda3/envs/busco/bin/vamb", line 11, in sys.exit(main()) File "/public/home/acq7wsloil/anaconda3/envs/busco/lib/python3.7/site-packages/vamb/main.py", line 528, in main logfile=logfile) File "/public/home/acq7wsloil/anaconda3/envs/busco/lib/python3.7/site-packages/vamb/main.py", line 251, in run dropout, cuda, batchsize, nepochs, lrate, batchsteps, logfile) File "/public/home/acq7wsloil/anaconda3/envs/busco/lib/python3.7/site-packages/vamb/main.py", line 150, in trainvae logfile=logfile, modelfile=modelpath) File "/public/home/acq7wsloil/anaconda3/envs/busco/lib/python3.7/site-packages/vamb/encode.py", line 466, in trainmodel raise ValueError('Last batch size exceeds dataset length') ValueError: Last batch size exceeds dataset length

Could you help me fix it, or share your command with me please?