RasmussenLab / phamb

Downstream processing of VAMB binning for Viral Elucidation
MIT License
44 stars 8 forks source link

Binning question, how to use vamb? #54

Open ChaoXianSen opened 11 months ago

ChaoXianSen commented 11 months ago

Hi ! Before operating phamb, i use vamb process binning, Vamb rum mode: vamb --outdir output63 --fasta R63.contigs.fa.gz --bamfiles R63_sort.bam -o C

report err.log : Traceback (most recent call last): File "/public/home/bioinfo_wang/00_software/miniconda3/envs/avamb/bin/vamb", line 33, in sys.exit(load_entry_point('vamb', 'console_scripts', 'vamb')()) File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 1395, in main run( File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 834, in run cluster( File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 665, in cluster clusternumber, ncontigs = vamb.vambtools.write_clusters( File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 440, in write_clusters for clustername, contigs in clusters: File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 701, in binsplit for newbinname, splitheaders in _split_bin(binname, headers, separator): File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 676, in _split_bin raise KeyError(f"Separator '{separator}' not in sequence label: '{header}'") KeyError: "Separator 'C' not in sequence label: 'k141_84347'"

But, the reuslt contain ‘k141_84347 ’ : ‘ less contignames |grep "k141_84347" -A2 -B2 ' --> 'k141_512747 k141_170723 k141_84347 k141_170724 k141_512748'

the vamb operation result file contain : '0 Oct 9 23:52 vae_clusters.tsv # why the file is empty? 7.7M Oct 9 23:52 contignames 2.6M Oct 9 23:52 lengths.npz 41K Oct 9 23:52 log.txt 77M Oct 9 23:52 latent.npz 815K Oct 9 23:51 model.pt 894 Oct 9 14:40 mask.npz 2.3M Oct 9 14:40 abundance.npz 252M Oct 9 14:38 composition.npz'

Thanks!

joacjo commented 11 months ago

Hi Chao

I see, this will not fly. It looks like there is an issue in the steps where you run vamb, your contignames has to be named with a strict format that makes you able to identify the original sample origin of each sample. When that works and you get a non zero vae_clusters.tsv file, the downstream step using phamb should be straight forward.

Maybe @enryH - Can help you identify what went wrong.

enryH commented 11 months ago
KeyError: "Separator 'C' not in sequence label: 'k141_84347'"

So what is your separator? I guess you know VAMB better than me:)

ChaoXianSen commented 11 months ago

the file R63.contigs.fa.gz format like this:

k141_514373 flag=1 multi=3.0000 len=356 CCATAAATCTGATTTTAGTCAAAAAAATATGCAGTTTTTCAAAAAGGGTGTATAATTCTTTCGTTACATGAAATATTTTGGAGGTGCTATTTTTATGAAAA k141_0 flag=1 multi=2.0000 len=345 AGGTGAAGATGACCGAAGAGGAGATTAAGGCCCGTGAGTTTGCCAAAGCGGCGCAGAAGGAGAAGGAGGACCGTGAGGCCAAGAAAGCGCTGG

joacjo commented 10 months ago

Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.

  1. Concatenate assemblies from different samples
    python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta  [ ... ]
  2. Do the mapping
    
    minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
    minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam

3. run VAMB

vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C


It will be hard to help if the naming of the contigs in the fasta file and in the `clusters.tsv` file are not correctly formatted. 
ChaoXianSen commented 10 months ago

Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.

  1. Concatenate assemblies from different samples
python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta  [ ... ]
  1. Do the mapping
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam
  1. run VAMB
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C

It will be hard to help if the naming of the contigs in the fasta file and in the clusters.tsv file are not correctly formatted.

Dear@joacjo Thanks for your reply! I have solved this problem. But, I have another question:
Run the RF model, the result file 'resultdir/vamb_bins/vamb_bins.1.fna' , how to cluster to get the vOTUs ? which tools and databse can we use to get the vOTUs ?

Thanks a lot !

joacjo commented 10 months ago

Hi Chao

The most straight forward to way to get vOTUs is following the approach described by the CheckV people

Copied from https://bitbucket.org/berkeleylab/checkv/src/master/

Rapid genome clustering based on pairwise ANI

First, create a blast+ database:
makeblastdb -in <my_seqs.fna> -dbtype nucl -out <my_db>

Next, use megablast from blast+ package to perform all-vs-all blastn of sequences:
blastn -query <my_seqs.fna> -db <my_db> -outfmt '6 std qlen slen' -max_target_seqs 10000 -o <my_blast.tsv> -num_threads 32

Next, calculate pairwise ANI by combining local alignments between sequence pairs:
anicalc.py -i <my_blast.tsv> -o <my_ani.tsv>

Finally, perform UCLUST-like clustering using the MIUVIG recommended-parameters (95% ANI + 85% AF):
aniclust.py --fna <my_seqs.fna> --ani <my_ani.tsv> --out <my_clusters.tsv> --min_ani 95 --min_tcov 85 --min_qcov 0

Relevant ani-scripts can be found the checkv repo.

ChaoXianSen commented 10 months ago

Dear@joacjo Thanks a lot ! Thanks for your reminding. I'll go over the instructions of checkV in detail.

But i have another problem: ' hmmsearch --cpu 16 \ -E 1.0e-05 \ -o R1_output.txt \ --tblout ./R1_annotations/all.hmmVOG.tbl \ AllVOG.hmm ./R1_proteins.faa '

This step runs very slowly ,is there anything I can do to speed up ?

I provid 16CPUs and 240G memeoy. ( #SBATCH -n 16 #SBATCH --mem=240G ) the file 'AllVOG.hmm' is about 3.2G, file 'R1_proteins.faa' is about 630M;