The integrase on prophages and other viruses issue

ChaoLab commented 6 months ago

Hi, I have run the GCF_009025895.1.fna as the input fasta file. When I looked at the GCF_009025895.1_find_proviruses/GCF_009025895.1_provirus.tsv,

seq_name        source_seq      start   end     length  n_genes v_vs_c_score    in_seq_edge     integrases
NZ_CP045015.1|provirus_1724523_1762986  NZ_CP045015.1   1724523 1762986 38464   58      77.7034 False   NA
NZ_CP045015.1|provirus_2885510_2934610  NZ_CP045015.1   2885510 2934610 49101   69      86.8599 False   NA
NZ_CP045015.1|provirus_3062427_3101502  NZ_CP045015.1   3062427 3101502 39076   42      54.4621 False   NA
NZ_CP045015.1|provirus_3855947_3906705  NZ_CP045015.1   3855947 3906705 50759   79      90.2594 False   NA
NZ_CP045015.1|provirus_4122492_4133364  NZ_CP045015.1   4122492 4133364 10873   13      12.6591 False   NA

the last column, I guess it shows the integrase hit numbers. But they are NA. Are they correct? Since I do find integrase results by aligning to the integrase db:

NZ_CP045015.1|provirus_1724523_1762986_1605 # 1725616 # 1726608 # -1 # ID=1_1605;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;genetic_code=11;gc_cont=0.448        PRK00236        5.098E-12       57
NZ_CP045015.1|provirus_4122492_4133364_3951 # 4122492 # 4123706 # 1 # ID=1_3951;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;genetic_code=11;gc_cont=0.503       PRK09692        2.917E-85       278
NZ_CP045015.1|provirus_3062427_3101502_2980 # 3099241 # 3100224 # 1 # ID=1_2980;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;genetic_code=11;gc_cont=0.477       PHA02601        7.151E-134      415
NZ_CP045015.1|provirus_4122492_4133364_3955 # 4125446 # 4126321 # 1 # ID=1_3955;partial=00;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;genetic_code=11;gc_cont=0.568       PF00589.RP35    1.947E-07       42
NZ_CP045015.1|provirus_3855947_3906705_3743 # 3905542 # 3906705 # 1 # ID=1_3743;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;genetic_code=11;gc_cont=0.543       PF00589.seed    6.062E-38       138

It is a mistake or I mistakenly understand it?

My initial idea is to summarize all the integrases on prophages and other viruses, so I search all integrases on GCF_009025895.1_summary/GCF_009025895.1_virus_proteins.faa

apcamargo commented 6 months ago

This is weird indeed. How did you generate these results for the integrases? Can you share de command?

ChaoLab commented 6 months ago

Here is the command:

import sys
from pathlib import Path

from genomad import database, mmseqs2, utils
from genomad._paths import GenomadOutputs

def identify_integrases(input_path, output_path, database_path, threads, sensitivity, evalue):
    """
    This function identifies integrases in genomic sequences using the MMseqs2 tool and the geNomad database.

    Args:
    - input_path (Path): Path to the annotated proteins output file.
    - output_path (Path): Path to the directory where the MMseqs2 outputs will be saved.
    - database_path (Path): Path to the geNomad database.
    - threads (int): Number of threads to use for MMseqs2 execution.
    - sensitivity (float): Sensitivity setting for MMseqs2.
    - evalue (float): E-value threshold for MMseqs2.

    Returns:
    None. The function writes outputs to files in the specified directory.
    """

    # Define the outputs object for managing output file paths
    outputs = GenomadOutputs(input_path.stem, output_path)

    # Create a console for logging (always verbose)
    console = utils.HybridConsole(output_file=None, verbose=True)

    # Initialize MMseqs2 with the geNomad database for integrase identification
    mmseqs2_obj = mmseqs2.MMseqs2(outputs.find_proviruses_mmseqs2_output, outputs.find_proviruses_mmseqs2_dir, input_path, database.Database(database_path), use_integrase_db=True)

    # Ensure the MMseqs2 directory and its parents are created
    if not outputs.find_proviruses_mmseqs2_dir.exists():
        outputs.find_proviruses_mmseqs2_dir.mkdir(parents=True, exist_ok=True)

    # Run MMseqs2 for integrase identification
    mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, 0)
    console.log(f"Integrases identified and written to {outputs.find_proviruses_mmseqs2_output}")

# Example usage
if __name__ == "__main__":
    # Define the input, output, and database paths
    input_path = Path("genomad_output/GCF_009025895.1_summary/GCF_009025895.1_virus_proteins.faa")
    output_path = Path("genomad_output/integrase_result_dir")
    database_path = Path("/storage1/data11/ViWrap/ViWrap_db/genomad_db")
    threads = 10  # Number of threads for MMseqs2
    sensitivity = 8.2  # Sensitivity for MMseqs2
    evalue = 0.001  # E-value threshold for MMseqs2

    # Call the function to identify integrases
    identify_integrases(input_path, output_path, database_path, threads, sensitivity, evalue)

ChaoLab commented 6 months ago

It seems that it is not related to my script for identifying integrases, since the provirus integrase searching result in GCF_009025895.1_find_proviruses/GCF_009025895.1_provirus_mmseqs2.tsv holds the same results at my first glance

apcamargo commented 6 months ago

I think you are right. Let me investigate this

apcamargo commented 6 months ago

I just fixed this in https://github.com/apcamargo/genomad/commit/698b669e1a3c863096adb27f2a2908d6538140fd. Thanks for reporting the issue!

ChaoLab commented 6 months ago

Hi, Many thanks for your quick response! I re-run the reference with the new find_provirus.py. I found that in the provirus result file GCF_009025895.1_find_proviruses/GCF_009025895.1_provirus.tsv, you have 6 hits.

seq_name        source_seq      start   end     length  n_genes v_vs_c_score    in_seq_edge     integrases
NZ_CP045015.1|provirus_1724523_1762986  NZ_CP045015.1   1724523 1762986 38464   58      77.7034 False   NZ_CP045015.1|provirus_1724523_1762986_1605
NZ_CP045015.1|provirus_2885510_2934610  NZ_CP045015.1   2885510 2934610 49101   69      86.8599 False   NA
NZ_CP045015.1|provirus_3062427_3101502  NZ_CP045015.1   3062427 3101502 39076   42      54.4621 False   NZ_CP045015.1|provirus_3062427_3101502_2980
NZ_CP045015.1|provirus_3855947_3906705  NZ_CP045015.1   3855947 3906705 50759   79      90.2594 False   NZ_CP045015.1|provirus_3855947_3906705_3743
NZ_CP045015.1|provirus_4122492_4133364  NZ_CP045015.1   4122492 4133364 10873   13      12.6591 False   NZ_CP045015.1|provirus_4122492_4133364_3951;NZ_CP045015.1|provirus_4122492_4133364_3955
NZ_CP045017.1|provirus_9572_33781       NZ_CP045017.1   9572    33781   24210   38      8.3280  False   NZ_CP045017.1|provirus_9572_33781_32

But in the final virus summary file GCF_009025895.1_summary/GCF_009025895.1_virus_summary.tsv, you have 5 provirus hits:

seq_name        length  topology        coordinates     n_genes genetic_code    virus_score     fdr     n_hallmarks     marker_enrichment       taxonomy
NZ_CP045015.1|provirus_2885510_2934610  49101   Provirus        2885510-2934610 69      11      0.9776  NA      14      76.0892 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_3855947_3906705  50759   Provirus        3855947-3906705 79      11      0.9774  NA      16      75.1552 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045018.1   51887   No terminal repeats     NA      57      11      0.9774  NA      14      67.7749 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_1724523_1762986  38464   Provirus        1724523-1762986 58      11      0.9771  NA      17      67.4772 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_3062427_3101502  39076   Provirus        3062427-3101502 42      11      0.9698  NA      17      46.0772 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_4122492_4133364  10873   Provirus        4122492-4133364 13      11      0.9657  NA      3       10.2417 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes

Is it because the last provirus was filtered out through your downstream pipeline? Just want to make sure.

apcamargo commented 6 months ago

Yes, that's probably what happened. You can run again with the --relaxed parameter and check if that provirus is included. The --relaxed flag disables a couple of filters.

Which parameters did you use to run geNomad? The list of proviruses I get when running with default parameters is different.

ChaoLab commented 6 months ago

Yes, it should be. That one virus should be filtered out. I did not use any additional parameters in my command line. Mine is also default: genomad end-to-end GCF_009025895.1.fna.gz genomad_output /storage1/data11/ViWrap/ViWrap_db/genomad_db -t 10

apcamargo / genomad

The integrase on prophages and other viruses issue #58