Open ChaoXianSen opened 1 year ago
Hi Chao
I see, this will not fly. It looks like there is an issue in the steps where you run vamb, your contignames has to be named with a strict format that makes you able to identify the original sample origin of each sample. When that works and you get a non zero vae_clusters.tsv file, the downstream step using phamb should be straight forward.
Maybe @enryH - Can help you identify what went wrong.
KeyError: "Separator 'C' not in sequence label: 'k141_84347'"
So what is your separator? I guess you know VAMB better than me:)
the file R63.contigs.fa.gz format like this:
k141_514373 flag=1 multi=3.0000 len=356 CCATAAATCTGATTTTAGTCAAAAAAATATGCAGTTTTTCAAAAAGGGTGTATAATTCTTTCGTTACATGAAATATTTTGGAGGTGCTATTTTTATGAAAA k141_0 flag=1 multi=2.0000 len=345 AGGTGAAGATGACCGAAGAGGAGATTAAGGCCCGTGAGTTTGCCAAAGCGGCGCAGAAGGAGAAGGAGGACCGTGAGGCCAAGAAAGCGCTGG
Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.
python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta [ ... ]
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index
minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam
3. run VAMB
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
It will be hard to help if the naming of the contigs in the fasta file and in the `clusters.tsv` file are not correctly formatted.
Hi Chao - did you manage to run VAMB as described in the example on the VAMB Github repo? I think that will solve many of your issues posted here. I hope that you are trying to process more than just one sample of course.
- Concatenate assemblies from different samples
python concatenate.py /path/to/catalogue.fna.gz /path/to/assemblies/sample1/contigs.fasta /path/to/assemblies/sample2/contigs.fasta [ ... ]
- Do the mapping
minimap2 -d catalogue.mmi /path/to/catalogue.fna.gz; # make index minimap2 -t 8 -N 5 -ax sr catalogue.mmi --split-prefix mmsplit /path/to/reads/sample1.fw.fq.gz /path/to/reads/sample1.rv.fq.gz | samtools view -F 3584 -b --threads 8 > /path/to/bam/sample1.bam
- run VAMB
vamb --outdir path/to/outdir --fasta /path/to/catalogue.fna.gz --bamfiles /path/to/bam/*.bam -o C
It will be hard to help if the naming of the contigs in the fasta file and in the
clusters.tsv
file are not correctly formatted.
Dear@joacjo
Thanks for your reply! I have solved this problem.
But, I have another question:
Run the RF model, the result file 'resultdir/vamb_bins/vamb_bins.1.fna' , how to cluster to get the vOTUs ?
which tools and databse can we use to get the vOTUs ?
Thanks a lot !
Hi Chao
The most straight forward to way to get vOTUs is following the approach described by the CheckV people
Copied from https://bitbucket.org/berkeleylab/checkv/src/master/
Rapid genome clustering based on pairwise ANI
First, create a blast+ database:
makeblastdb -in <my_seqs.fna> -dbtype nucl -out <my_db>
Next, use megablast from blast+ package to perform all-vs-all blastn of sequences:
blastn -query <my_seqs.fna> -db <my_db> -outfmt '6 std qlen slen' -max_target_seqs 10000 -o <my_blast.tsv> -num_threads 32
Next, calculate pairwise ANI by combining local alignments between sequence pairs:
anicalc.py -i <my_blast.tsv> -o <my_ani.tsv>
Finally, perform UCLUST-like clustering using the MIUVIG recommended-parameters (95% ANI + 85% AF):
aniclust.py --fna <my_seqs.fna> --ani <my_ani.tsv> --out <my_clusters.tsv> --min_ani 95 --min_tcov 85 --min_qcov 0
Relevant ani-scripts can be found the checkv repo.
Dear@joacjo Thanks a lot ! Thanks for your reminding. I'll go over the instructions of checkV in detail.
But i have another problem: ' hmmsearch --cpu 16 \ -E 1.0e-05 \ -o R1_output.txt \ --tblout ./R1_annotations/all.hmmVOG.tbl \ AllVOG.hmm ./R1_proteins.faa '
This step runs very slowly ,is there anything I can do to speed up ?
I provid 16CPUs and 240G memeoy. ( #SBATCH -n 16 #SBATCH --mem=240G ) the file 'AllVOG.hmm' is about 3.2G, file 'R1_proteins.faa' is about 630M;
Hi ! Before operating phamb, i use vamb process binning, Vamb rum mode: vamb --outdir output63 --fasta R63.contigs.fa.gz --bamfiles R63_sort.bam -o C
report err.log : Traceback (most recent call last): File "/public/home/bioinfo_wang/00_software/miniconda3/envs/avamb/bin/vamb", line 33, in sys.exit(load_entry_point('vamb', 'console_scripts', 'vamb')()) File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 1395, in main run( File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 834, in run cluster( File "/public/home/bioinfo_wang/00_software/vamb/vamb/main.py", line 665, in cluster clusternumber, ncontigs = vamb.vambtools.write_clusters( File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 440, in write_clusters for clustername, contigs in clusters: File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 701, in binsplit for newbinname, splitheaders in _split_bin(binname, headers, separator): File "/public/home/bioinfo_wang/00_software/vamb/vamb/vambtools.py", line 676, in _split_bin raise KeyError(f"Separator '{separator}' not in sequence label: '{header}'") KeyError: "Separator 'C' not in sequence label: 'k141_84347'"
But, the reuslt contain ‘k141_84347 ’ : ‘ less contignames |grep "k141_84347" -A2 -B2 ' --> 'k141_512747 k141_170723 k141_84347 k141_170724 k141_512748'
the vamb operation result file contain : '0 Oct 9 23:52 vae_clusters.tsv # why the file is empty? 7.7M Oct 9 23:52 contignames 2.6M Oct 9 23:52 lengths.npz 41K Oct 9 23:52 log.txt 77M Oct 9 23:52 latent.npz 815K Oct 9 23:51 model.pt 894 Oct 9 14:40 mask.npz 2.3M Oct 9 14:40 abundance.npz 252M Oct 9 14:38 composition.npz'
Thanks!