RasmussenLab / vCentenarian

Repository with supporting code and information for virome analysis of Centenarian metagenomes
3 stars 3 forks source link

Getting a vOTU query #2

Open AndAvia opened 2 months ago

AndAvia commented 2 months ago

hi, I download the vOTU.fna (There are 5251 vOTUs in there.)data in the text and used the MGV pipeline(https://github.com/snayfach/MGV/tree/master/viral_detection_pipeline), like this

conda activate hmmer
cd ~/virus1/temp/MGV/mgv/viral_detection_pipeline
cp ~/virus/viwrap_result/paper/vOTUs.fna ./input/
#Call viral genes
./prodigal.linux -i input/vOTUs.fna -a input/vOTUs.faa -d input/vOTUs.ffn -p meta -f gff > input/vOTUs.gff #5251
#Output files include proteins (.faa), genes (.ffn), and gene coordinates (.gff). Genes called in metagenomic mode (-p meta).

#Run HMMER on imgvr and pfam databases
hmmsearch -Z 1 --cpu 16 --noali --tblout output/imgvr.out input/imgvr.hmm input/vOTUs.faa
hmmsearch -Z 1 --cut_tc --cpu 16 --noali  --tblout output/pfam.out input/pfam.hmm input/vOTUs.faa
#Here the -Z 1 flag is specified to make E-values comparable between databases and between samples. The IMG/VR database contains viral genes while the Pfam database contains non-viral genes.

#Count genes hitting viral and microbial marker genes
python count_hmm_hits.py input/vOTUs.fna input/vOTUs.faa output/imgvr.out output/pfam.out > output/hmm_hits.tsv
#Each gene assigned according to its best hit with E-value <1e-10. We found that several of the HMMs from IMG/VR HMMs are commonly found in non-viral genomes and several of the HMMs from Pfam are commonly found in viral genomes. Those HMMs are excluded from the analysis.

#Run VirFinder
Rscript virfinder.R input/vOTUs.fna output/virfinder.tsv
#VirFinder scores each contig using a machine-learning algorithm based on kmers https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-017-0283-5

#Quantify strand switch rate of genes
python strand_switch.py input/vOTUs.fna input/vOTUs.faa > output/strand_switch.tsv
#Here the code scans the proteins from each genome in genomic order. It counts the nuber of strand switches (+ to - or - to +) and divides by the total number of genes.

#Create master table of sequence features
python master_table.py output/hmm_hits.tsv output/virfinder.tsv output/strand_switch.tsv > output/master_table.tsv

#Predict viral contigs
python viral_classify.py \
--features output/master_table.tsv \
--in_base input/vOTUs \
--out_base output/vOTUs #vOTUs.tsv  #最终得到2897个病毒

During the run no error was reported, all steps were 5251, after running the last step python viral_classify.py, finally only got 2897 viruses, in the literature the result was 4422, where is the problem, I see viral_classify.py used to a rules file classification_rules.tsv, is that file different from what you guys were using. Thanks for answering the questions!

joacjo commented 2 months ago

Hi Chuang Ma

Your result sounds quite different. You could run a tool like Vibrant to get closer to the set of 4422 vOTUs used in the analysis (see article "Furthermore, we also parsed vOTUs using VIBRANT (v.1.2.1, default settings) and found a 4240/4422 (96%) viral prediction agreement.").