DessimozLab / OMArk

GNU Lesser General Public License v3.0
53 stars 6 forks source link

Problem during writing step of OMArK #33

Open Enorya opened 1 month ago

Enorya commented 1 month ago

Dear,

I'm trying to use OMArK to assess the quality of a genome annotation and to be able to compare multiple ones later. However, whenever I try to run OMArK with the command omark -f killi_lifton_annot_correct3.omamer -d LUCA.h5 -o killi_lifton -t 105023 -of ../GCF_001465895.1_Nfu_20140520_genomic.fna -v on my dataset I get the following error:

INFO: Starting OMArk
INFO: Input parameters passed validity check
INFO: Extracting data from input file: killi_lifton_annot_correct3.omamer
INFO: Determinating species composition from HOG placements
INFO: A taxid was provided. The query taxon is Nothobranchius furzeri
INFO: Ancestral lineage is Cyprinodontiformes
INFO: Estimating ancestral and conserved HOG content
INFO: 29752 HOGs are associated to the query's lineage and will be used for consistency assesment
INFO: 16815 conserved ancestral HOGs will be used from completeness assesment
INFO: Comparing the query gene repertoire to lineage-associated HOGs
INFO: Comparing the query gene repertoire to conserved ancestral HOGs
INFO: Writing OMArk output files
Traceback (most recent call last):
  File "/data/leuven/340/vsc34088/miniconda3/envs/omark/bin/omark", line 51, in <module>
    omark.launcher(arg)
  File "/data/leuven/340/vsc34088/miniconda3/envs/omark/lib/python3.12/site-packages/omark/omark.py", line 265, in launcher
    get_omamer_qscore(omamerfile, dbpath, outdir, taxid, original_FASTA_file = original_fasta, isoform_file=isoform_file, taxonomic_rank=taxonomic_rank)
  File "/data/leuven/340/vsc34088/miniconda3/envs/omark/lib/python3.12/site-packages/omark/omark.py", line 149, in get_omamer_qscore
    io.store_contaminant_FASTA(stordir, basefile, prot_clade, original_FASTA_file)
  File "/data/leuven/340/vsc34088/miniconda3/envs/omark/lib/python3.12/site-packages/omark/files.py", line 307, in store_contaminant_FASTA
    seq = seqs_by_id[prot_data[1]]
          ~~~~~~~~~~^^^^^^^^^^^^^^
KeyError: 'XM_054735811'

I tried to change the headers of my protein sequences in the omamer file by removing special characters or removing parts of the headers multiple times but I always get the same issue. However, when I remove the specific sequence mentioned in the error the writing continue a little bit and then stop to other sequences. The only link I found between these problematic sequences is the hoglevel (Nothobranchius furzeri in this case) but I don't understand why it's causing an issue. Here is a little part of my omamer file in which sequences XM_054735811 and XM_054735812 are problematic:

XM_015944585    HOG:D0639138.2m.4a      Neopterygii     2399.996013387185       415     0.9975890680094172      0.9941509286027648      170     422     422     0.997624703087886
XM_054739016    HOG:D0577675.4c Atherinomorphae 967.1020560800204       147     0.5503948086850903      0.5061727553615216      123     279     267     0.6330935251798561
XM_054735811    HOG:D0663830.1a.2b.3c.5a        Nothobranchius furzeri  772.5277723683847       109     0.7170118081595099      0.6496349414709253      89      158     339     0.8853503184713376
XM_054735812    HOG:D0663830.1a.2b.3c.5a        Nothobranchius furzeri  772.5277723683847       109     0.7170118081595099      0.6496349414709253      89      158     339     0.8853503184713376
XM_015944606    HOG:D0579197    Percomorphaceae 6395.46596045576        835     0.9988032766690962      0.9988009040872586      835     842     803     0.9988109393579072
XM_015944625    HOG:D0680509.8b.6b.13a.27b.26a.6a       Euteleosteomorpha       2497.435360199245       553     0.9981756987901452      0.9874996605198554      79      560     560     0.998211091234347
XM_015944629    HOG:D0680509.8b.6b.13a.27b.26a.6a       Euteleosteomorpha       2497.435360199245       553     0.9981756987901452      0.9874996605198554      79      560     560     0.998211091234347
XM_054739017    HOG:D0680509.8b.6b.13a.27b.26a.6a       Euteleosteomorpha       2497.435360199245       553     0.9981756987901452      0.9874996605198554      79      560     560     0.998211091234347
XM_015944642    HOG:D0651108.2d.9b      Neopterygii     4197.127911957727       1119    0.5871173087195832      0.4508677375656414      647     2029    1139    0.997534516765286
XM_015944653    HOG:D0651108.2d.9b      Neopterygii     4153.4110205187735      1109    0.5849281109927518      0.4466258709094147      636     2019    1139    0.9975222993062438

Do you have any idea why I'm experiencing this issue?

Thank you in advance for your help, Enora

YanNevers commented 1 month ago

Dear Enora,

Thank you for using OMArk and reporting this issue.

From a cursory look at this issue, it seems like this may be due to the protein identifiers in the FASTA file provided through the '-of' option (GCF_001465895.1_Nfu_20140520_genomic.fna) and the one in the OMAmer output being different. Is this FASTA file the original proteome FASTA file that was first provided to OMAmer? From the file name, it looks like it may be instead a FASTA file of the genomic sequence. If it is the case, please try changing this file to the proteome FASTA file that was used as input to OMAmer.

Let me know if I am mistaken or if changing this file do not solve the issue. If it does not, could you write down a few examples of the headers from this file?

If it does, I'll keep this issue open as an incentive to work on better parameter checking.

Best wishes, Yannis

Enorya commented 1 month ago

Dear Yannis,

Thank you for your quick reply. I indeed used the FASTA file of the genomic sequence and when I changed this file to the proteome FASTA file the OMArK tool worked.

Best regards, Enora