Closed ChaoLab closed 6 months ago
This is weird indeed. How did you generate these results for the integrases? Can you share de command?
Here is the command:
import sys
from pathlib import Path
from genomad import database, mmseqs2, utils
from genomad._paths import GenomadOutputs
def identify_integrases(input_path, output_path, database_path, threads, sensitivity, evalue):
"""
This function identifies integrases in genomic sequences using the MMseqs2 tool and the geNomad database.
Args:
- input_path (Path): Path to the annotated proteins output file.
- output_path (Path): Path to the directory where the MMseqs2 outputs will be saved.
- database_path (Path): Path to the geNomad database.
- threads (int): Number of threads to use for MMseqs2 execution.
- sensitivity (float): Sensitivity setting for MMseqs2.
- evalue (float): E-value threshold for MMseqs2.
Returns:
None. The function writes outputs to files in the specified directory.
"""
# Define the outputs object for managing output file paths
outputs = GenomadOutputs(input_path.stem, output_path)
# Create a console for logging (always verbose)
console = utils.HybridConsole(output_file=None, verbose=True)
# Initialize MMseqs2 with the geNomad database for integrase identification
mmseqs2_obj = mmseqs2.MMseqs2(outputs.find_proviruses_mmseqs2_output, outputs.find_proviruses_mmseqs2_dir, input_path, database.Database(database_path), use_integrase_db=True)
# Ensure the MMseqs2 directory and its parents are created
if not outputs.find_proviruses_mmseqs2_dir.exists():
outputs.find_proviruses_mmseqs2_dir.mkdir(parents=True, exist_ok=True)
# Run MMseqs2 for integrase identification
mmseqs2_obj.run_mmseqs2(threads, sensitivity, evalue, 0)
console.log(f"Integrases identified and written to {outputs.find_proviruses_mmseqs2_output}")
# Example usage
if __name__ == "__main__":
# Define the input, output, and database paths
input_path = Path("genomad_output/GCF_009025895.1_summary/GCF_009025895.1_virus_proteins.faa")
output_path = Path("genomad_output/integrase_result_dir")
database_path = Path("/storage1/data11/ViWrap/ViWrap_db/genomad_db")
threads = 10 # Number of threads for MMseqs2
sensitivity = 8.2 # Sensitivity for MMseqs2
evalue = 0.001 # E-value threshold for MMseqs2
# Call the function to identify integrases
identify_integrases(input_path, output_path, database_path, threads, sensitivity, evalue)
It seems that it is not related to my script for identifying integrases, since the provirus integrase searching result in GCF_009025895.1_find_proviruses/GCF_009025895.1_provirus_mmseqs2.tsv
holds the same results at my first glance
I think you are right. Let me investigate this
I just fixed this in https://github.com/apcamargo/genomad/commit/698b669e1a3c863096adb27f2a2908d6538140fd. Thanks for reporting the issue!
Hi,
Many thanks for your quick response!
I re-run the reference with the new find_provirus.py
.
I found that in the provirus result file GCF_009025895.1_find_proviruses/GCF_009025895.1_provirus.tsv
, you have 6 hits.
seq_name source_seq start end length n_genes v_vs_c_score in_seq_edge integrases
NZ_CP045015.1|provirus_1724523_1762986 NZ_CP045015.1 1724523 1762986 38464 58 77.7034 False NZ_CP045015.1|provirus_1724523_1762986_1605
NZ_CP045015.1|provirus_2885510_2934610 NZ_CP045015.1 2885510 2934610 49101 69 86.8599 False NA
NZ_CP045015.1|provirus_3062427_3101502 NZ_CP045015.1 3062427 3101502 39076 42 54.4621 False NZ_CP045015.1|provirus_3062427_3101502_2980
NZ_CP045015.1|provirus_3855947_3906705 NZ_CP045015.1 3855947 3906705 50759 79 90.2594 False NZ_CP045015.1|provirus_3855947_3906705_3743
NZ_CP045015.1|provirus_4122492_4133364 NZ_CP045015.1 4122492 4133364 10873 13 12.6591 False NZ_CP045015.1|provirus_4122492_4133364_3951;NZ_CP045015.1|provirus_4122492_4133364_3955
NZ_CP045017.1|provirus_9572_33781 NZ_CP045017.1 9572 33781 24210 38 8.3280 False NZ_CP045017.1|provirus_9572_33781_32
But in the final virus summary file GCF_009025895.1_summary/GCF_009025895.1_virus_summary.tsv
, you have 5 provirus hits:
seq_name length topology coordinates n_genes genetic_code virus_score fdr n_hallmarks marker_enrichment taxonomy
NZ_CP045015.1|provirus_2885510_2934610 49101 Provirus 2885510-2934610 69 11 0.9776 NA 14 76.0892 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_3855947_3906705 50759 Provirus 3855947-3906705 79 11 0.9774 NA 16 75.1552 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045018.1 51887 No terminal repeats NA 57 11 0.9774 NA 14 67.7749 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_1724523_1762986 38464 Provirus 1724523-1762986 58 11 0.9771 NA 17 67.4772 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_3062427_3101502 39076 Provirus 3062427-3101502 42 11 0.9698 NA 17 46.0772 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
NZ_CP045015.1|provirus_4122492_4133364 10873 Provirus 4122492-4133364 13 11 0.9657 NA 3 10.2417 Viruses;Duplodnaviria;Heunggongvirae;Uroviricota;Caudoviricetes
Is it because the last provirus was filtered out through your downstream pipeline? Just want to make sure.
Yes, that's probably what happened. You can run again with the --relaxed
parameter and check if that provirus is included. The --relaxed
flag disables a couple of filters.
Which parameters did you use to run geNomad? The list of proviruses I get when running with default parameters is different.
Yes, it should be. That one virus should be filtered out.
I did not use any additional parameters in my command line. Mine is also default:
genomad end-to-end GCF_009025895.1.fna.gz genomad_output /storage1/data11/ViWrap/ViWrap_db/genomad_db -t 10
Hi, I have run the
GCF_009025895.1.fna
as the input fasta file. When I looked at theGCF_009025895.1_find_proviruses/GCF_009025895.1_provirus.tsv
,the last column, I guess it shows the integrase hit numbers. But they are
NA
. Are they correct? Since I do find integrase results by aligning to the integrase db:It is a mistake or I mistakenly understand it?
My initial idea is to summarize all the integrases on prophages and other viruses, so I search all integrases on
GCF_009025895.1_summary/GCF_009025895.1_virus_proteins.faa