Open duqiyao opened 1 year ago
Hello,
The error message indicates that MeTAfisher is not able to map the gene IDs from the hmmsearch results (originally sourced from the faa file) to the CDS entries of the gff file.
MeTAfisher expects to find the gene ID either in the attribute protein_id=
of the CDS line or, if that attribute is absent, it uses the ID found in the ID=
attribute of the line. Each of these IDs should correspond to a protein sequence present in the faa file.
To diagnose this issue, can you verify whether the protein_id
or ID
attributes in your GFF file match the sequence IDs in your FAA file?
For instance, if you have the following sequence in the faa file:
>WP_003806928.1
MKVMASVKRICRNCKVIKRHGVVRVICTDPRHKQRQG
It should have a corresponding line in the gff file with the same protein_id=WP_003806903.1
:
NC_010645.1 Protein Homology CDS 31171 31482 . + 0 gene=rpsJ;protein_id=WP_003806903.1;transl_table=11
or ID=WP_003806903.1
NC_010645.1 Protein Homology CDS 31171 31482 . + 0 ID=WP_003806903.1;Parent=gene-BAV_RS00140;Dbxref=Genbank:WP_003806903.1,GeneID:41391953
Best regards,
Jean
Thank you for your quick reply, I used the protein prediction made by prodigal to get the.FAA and.GFF at the same time.
Unfortunately, the gff and faa files generated by Prodigal do not seem to be compatible with metafisher's expectations. This is something that could be enhanced in the code for better compatibility.
As a quick fix, you could process the gff file to include the "protein_id" attribute that metafisher requires. To accomplish this, you can use an AWK command:
awk 'BEGIN {OFS="\t"} $1 != prev_first {counter = 1; prev_first = $1} $1 !~ /^#/ {print $0"protein_id="$1"_"counter++";"}' your_initial.gff > your_new.gff
This appends a new "protein_id" attribute at the end of each line, composed of the contig's name and a counter.
After applying this command, you can try to run metafisher using the newly generated gff file.
Best, Jean
Hello, teacher, I'm sorry to bother you again. After I modified awk according to you, there are still problems in operation. I would like to ask what software you used to make protein prediction.
cat nohup.out
INFO: Mode verbose ON
INFO: command line: metafisher/metafisher.py --gff data_test/GCF_000070465.1/vOTU-gens_new.gff.gz --faa data_test/GCF_000070465.1/vOTU-protein.faa.gz --outdir metafisher_results-1 --diamond_db TA_data/type_II_TA.dmnd -v
INFO: Running hmmsearch to identify TA genes.
INFO: Protein sequence file data_test/GCF_000070465.1/vOTU-protein.faa.gz ends with gz. It will be unzipped first to be used in hmmsearch.
INFO: zcat command: zcat data_test/GCF_000070465.1/vOTU-protein.faa.gz > metafisher_results-1/vOTU-protein.faa
INFO: hmmsearch command: hmmsearch -E 0.0005 --domtblout metafisher_results-1/vOTU-protein.faa.hmmsearch /data/home-lusy/software/MeTAfisher/TADB_stat/TA_domains.hmm metafisher_results-1/vOTU-protein.faa > metafisher_results-1/vOTU-protein.faa.hmmsearch.LOG
INFO: diamond command: diamond blastp -q data_test/GCF_000070465.1/vOTU-protein.faa.gz -d TA_data/type_II_TA.dmnd -o metafisher_results-1/vOTU-protein.faa.diamond --outfmt 6 qseqid sseqid qstart qend sstart send pident qcovhsp evalue bitscore stitle > /dev/null
diamond v2.1.8.162 (C) Max Planck Society for the Advancement of Science, Benjamin Buchfink, University of Tuebingen
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
#CPU threads: 64
Scoring parameters: (Matrix=BLOSUM62 Lambda=0.267 K=0.041 Penalties=11/1)
Temporary directory: metafisher_results-1
#Target sequences to report alignments for: 25
Opening the database... [0.003s]
Database: TA_data/type_II_TA.dmnd (type: Diamond database, sequences: 12358, letters: 1525882)
Block size = 2000000000
Opening the input file... [0.409s]
Opening the output file... [0s]
Loading query sequences... [7.874s]
Masking queries... [1.118s]
Algorithm: Double-indexed
Building query histograms... [0.741s]
Seeking in database... [0s]
Loading reference sequences... [0.005s]
Masking reference... [0.009s]
Initializing temporary storage... [0s]
Building reference histograms... [0.008s]
Allocating buffers... [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 1/4.
Building reference seed array... [0.009s]
Building query seed array... [0.27s]
Computing hash join... [0.111s]
Masking low complexity seeds... [0.002s]
Searching alignments... [0.004s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 2/4.
Building reference seed array... [0.006s]
Building query seed array... [0.323s]
Computing hash join... [0.125s]
Masking low complexity seeds... [0.004s]
Searching alignments... [0.011s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 3/4.
Building reference seed array... [0.017s]
Building query seed array... [0.358s]
Computing hash join... [0.102s]
Masking low complexity seeds... [0.002s]
Searching alignments... [0.018s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 1/2, index chunk 4/4.
Building reference seed array... [0.006s]
Building query seed array... [0.264s]
Computing hash join... [0.085s]
Masking low complexity seeds... [0.003s]
Searching alignments... [0.006s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 1/4.
Building reference seed array... [0.008s]
Building query seed array... [0.285s]
Computing hash join... [0.113s]
Masking low complexity seeds... [0.004s]
Searching alignments... [0.011s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 2/4.
Building reference seed array... [0.012s]
Building query seed array... [0.357s]
Computing hash join... [0.081s]
Masking low complexity seeds... [0.002s]
Searching alignments... [0.004s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 3/4.
Building reference seed array... [0.01s]
Building query seed array... [0.348s]
Computing hash join... [0.093s]
Masking low complexity seeds... [0.003s]
Searching alignments... [0.007s]
Deallocating memory... [0s]
Processing query block 1, reference block 1/1, shape 2/2, index chunk 4/4.
Building reference seed array... [0.008s]
Building query seed array... [0.265s]
Computing hash join... [0.08s]
Masking low complexity seeds... [0.004s]
Searching alignments... [0.006s]
Deallocating memory... [0s]
Deallocating buffers... [0.048s]
Clearing query masking... [0.128s]
Computing alignments... Loading trace points... [0.004s]
Sorting trace points... [0.005s]
Computing alignments... [3.822s]
Deallocating buffers... [0s]
Loading trace points... [0s]
[3.833s]
Deallocating reference... [0s]
Loading reference sequences... [0s]
Deallocating buffers... [0s]
Deallocating queries... [0.032s]
Loading query sequences... [0s]
Closing the input file... [0s]
Closing the output file... [0.004s]
Closing the database... [0s]
Cleaning up... [0s]
Total time = 17.673s
Reported 30902 pairwise alignments, 30902 HSPs.
7168 queries aligned.
Traceback (most recent call last):
File "/data/home-lusy/software/MeTAfisher/metafisher/metafisher.py", line 309, in <module>
main()
File "/data/home-lusy/software/MeTAfisher/metafisher/metafisher.py", line 290, in main
out.write_result(genes, dict_output, contig)
File "/data/home-lusy/software/MeTAfisher/metafisher/OutputFct_MetaF.py", line 175, in write_result
write_table_pairs_result(gene, g_post, dict_output['result_TA_pairs'])
File "/data/home-lusy/software/MeTAfisher/metafisher/OutputFct_MetaF.py", line 225, in write_table_pairs_result
best_shared_families = get_best_shared_families(gene_prev, gene_post)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/home-lusy/software/MeTAfisher/metafisher/OutputFct_MetaF.py", line 266, in get_best_shared_families
best_sum_score = max(shared_families2sum_score.values())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: max() arg is an empty sequence
Hello, teacher! I followed your github operation steps under the installation, I used prodigal to predict the virus protein .faa and.gff files. Then run the code as follows: nohup ./metafisher/metafisher.py --gff data_test/GCF_000070465.1/vOTU-gene.gff.gz \ --faa data_test/GCF_000070465.1/vOTU-protein.faa.gz\ --outdir metafisher_results1 \ --diamond_db TA_data/type_II_TA.dmnd -v &
The error is as follows:
ask for your help