apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
168 stars 17 forks source link

Finding chimeric sequences #47

Closed bhimbbiswa closed 8 months ago

bhimbbiswa commented 8 months ago

Thank you very much for making this amazing tool. This is becoming very useful in my research.

I performed a metaSpades assembly on pairend 150bp Illumina reads and subsequently employed VirSorter2 to isolate viral sequences. Currently, I am utilizing geNomad to further refine my dataset by excluding non-viral sequences, determining taxonomy, and identifying potential chimeric viral sequences.

I ran geNomad using the following command. My fasta file contains 143117 sequences.

genomad end-to-end --enable-score-calibration --cleanup --threads 10 --composition metagenome VirSorter_combined.fasta VirSorter genomad_db

Regarding the chimeric sequences, those that contain genes from two distinct groups, such as "S03-NODE_15397_length_3234_cov_0.690698||full," I am considering as chimeric.

File: VirSorter_combined_virus_summary.tsv

seq_name    length  topology    coordinates n_genes genetic_code    virus_score fdr n_hallmarks marker_enrichment   taxonomy
S03-NODE_15397_length_3234_cov_0.690698||full   3234    No terminal repeats NA  4   11  0.9507  0.005   1   3.4366  Viruses
S05-NODE_6353_length_8129_cov_0.841540||full    8129    No terminal repeats NA  12  11  0.8916  0.0092  0   2.9931  Viruses;Bicaudaviridae

File: VirSorter_combined_genes.tsv

gene    start   end length  strand  gc_content  genetic_code    rbs_motif   marker  evalue  bitscore    uscg    plasmid_hallmark    virus_hallmark  taxid   taxname annotation_conjscan annotation_amr  annotation_accessions   annotation_description
S03-NODE_15397_length_3234_cov_0.690698||full_1 3   89  87  -1  0.379   11  AGxAGG/AGGxGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S03-NODE_15397_length_3234_cov_0.690698||full_2 304 1230    927 -1  0.372   11  AGGAG   GENOMAD.044416.VV   3.03E-05    48  0   0   1   2561    Caudoviricetes  NA  NA  TIGR01537   NA
S03-NODE_15397_length_3234_cov_0.690698||full_3 1507    2265    759 1   0.461   11  AGGAG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S03-NODE_15397_length_3234_cov_0.690698||full_4 2622    3233    612 1   0.451   11  AGGAG   GENOMAD.096491.VV   5.21E-13    70  0   0   0   352 Marseilleviridae    NA  NA  NA  NA

S05-NODE_6353_length_8129_cov_0.841540||full_1  96  419 324 -1  0.358   11  GGA/GAG/AGG NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_2  601 1605    1005    1   0.506   11  AGxAGG/AGGxGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_3  1620    2165    546 1   0.579   11  AGGAG   GENOMAD.167727.VP   3.31E-05    46  0   0   0   2561    Caudoviricetes  NA  NA  PF18306;COG4474;TIGR00725   Uncharacterized SPBc2 prophage-derived protein YoqJ
S05-NODE_6353_length_8129_cov_0.841540||full_4  2237    2680    444 -1  0.563   11  AGGAG/GGAGG NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_5  2673    3134    462 -1  0.552   11  GGAGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_6  3197    4777    1581    -1  0.593   11  GGAGG   GENOMAD.184062.VV   1.80E-05    50  0   0   0   3654    Bicaudaviridae  NA  NA  COG4245;K16630  Uncharacterized conserved protein YegL, contains vWA domain of TerY type
S05-NODE_6353_length_8129_cov_0.841540||full_7  4901    5332    432 -1  0.528   11  AGGAG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_8  5450    5896    447 -1  0.597   11  GGAGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_9  5915    7267    1353    -1  0.582   11  AGxAGG/AGGxGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_10 7281    7631    351 -1  0.553   11  GGA/GAG/AGG NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_11 7631    7813    183 -1  0.563   11  GGAGG   NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA
S05-NODE_6353_length_8129_cov_0.841540||full_12 7810    8127    318 -1  0.513   11  None    NA  NA  NA  0   0   0   1   NA  NA  NA  NA  NA

Now, I have a question about the "Viruses;Bicaudaviridae" taxonomy. Upon examining the "genes.tsv" file, I noticed the presence of two different genes from different families. Should I consider this sequence as chimeric as well?

Thanks and regards,

Bhim

apcamargo commented 8 months ago

Hi @bhimbbiswa

bhimbbiswa commented 8 months ago

Dear Antônio Camargo

Thank you for your kind reply.

  • It might be a good idea to run geNomad on the whole assembly to get additional viruses. The intersection between VirSorter2 and geNomad might be too conservative. I've never really benchmarked this, but it's my expectation. You could take the union of the VirSorter2 and geNomad predictions and process it with CheckV.

Thank you very much for your suggestions. To clarify, I'm utilizing phamb, VirSorter2, and geNomad, and then consolidating their results in CheckV.

  • What do you mean by chimeric, biologically? If you mean a virus that has genes from two or more lineages, derived from HGT, I don't think that's a good approach to identify them. geNomad's taxonomy works well when it aggregates the taxonomy of multiple markers into a consensus. The taxonomy of individual genes can be noisy.

Actually not biologically, I have an additional step in my process where I use vRhyme for binning after running VirSorter2. However, I'm concerned that this step may result in a few bins containing sequences from two different taxonomically distinct viruses. So, I thought I can use geNomad to filter out those.

Thank and regards,

Bhim

apcamargo commented 8 months ago

Ohh, I see.

I've never benchmarked the performance of the taxonomy module on bins. Your logic makes sense, but I'd be careful calling those bins chimeras if the taxonomy of the contig was determined by a single gene. The taxonomy is much more reliable when it is derived from the consensus of multiple genes.

bhimbbiswa commented 8 months ago

Thank you very much for your kind suggestions. If any more doubts I will post again.

Regards,

Bhim