Kofamscan format problem

David-Madrigal commented 1 year ago

Hi,

Thank you for developing KEMET. I'm very interested in making use of the three main modules included in this package. Nonetheless, I'm facing a couple of issues and I kindly request some assistance.

For testing purposes, I'm currently working with a high-quality MAG with filename KEMET/genomes/SB_biofilm_MAG_1_.fa. KEGG annotations were performed with KofamKOALA, and were included as a tsv file (KEMET/KEGG_annotations/SB_biofilm_MAG_2_.tsv). I'm running the --hmm_mode modules option with "M00001" as the only input for the "module_file.instruction" file . The "genomes.instruction" file contains the following (tab separated):

id      taxonomy        universe
SB_biofilm_MAG_1_.fa    Bacteroidetes   gramneg

Currently I'm running the following command in the KEMET directory: ./kemet.py genomes/SB_biofilm_MAG_1_.fa -a kofamkoala --log --hmm_mode modules --skip_gsmm

Issues:

For the KEGG modules completeness evaluation, I'm getting unexpected results compared to the output from KEGG mapper. While in KEGG mapper I'm having multiple complete modules (e.g., M00001), both outputs from KEMET (.tsv and .txt) display that every module is incomplete (with 0% completeness). Here is an example of how the output .txt file looks:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       0.0     0__10   INCOMPLETE
1.      K00844, K12407, K00845, K00886, K08074, K00918
2.      K01810, K06859, K13810, K15916
3.      K00850, K16370, K00918
4.      K01623, K01624, K11645, K16305, K16306
5.      K01803
6.      K00134, K00150, K11389
7.      K00927, K11389
8.      K01834, K15633, K15634, K15635
9.      K01689
10.     K00873, K12406

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       0.0     0__6    INCOMPLETE
1.      K01803
2.      K00134, K00150, K11389
3.      K00927, K11389
4.      K01834, K15633, K15634, K15635
...

Despite the previous issue, I tried to run the --hmm_mode modules option with "M00001". While running the HMM search function, the following is printed:

Alignment input open failed.
   couldn't guess alphabet (maybe try --dna/--rna/--amino if available)
   while reading file K12406.msa
   while parsing for aligned FASTA format
Alignment input open failed.
   couldn't guess alphabet (maybe try --dna/--rna/--amino if available)
   while reading file K01624.msa
   while parsing for aligned FASTA format
...

I looked for the *.msa files at their respective directories, and it seems that the files are blank. Consequently, the following is printed on screen:

Error: File existence/permissions problem in trying to open query file K12406.h$
HMM file K12406.hmm not found (nor an .h3m binary of it)

Error: File existence/permissions problem in trying to open query file K01624.h$
HMM file K01624.hmm not found (nor an .h3m binary of it)
...

After the completion of nhmer significant hints, the following traceback is printed on screen:

Traceback (most recent call last):
  File "./kemet.py", line 2536, in <module>
    HMM_hits_longestTRANSLATED_dict = HMM_hits_longest_translated_sequences(HMM$
  File "./kemet.py", line 1290, in HMM_hits_longest_translated_sequences
    max_len_dict[fasta_nf].append(seq_max) # add the longest to list
KeyError: '>SB_biofilm_MAG_1'

If necessary, I would gladly share via e-mail the original nucleotide fasta and KEGG annotations files.

Thank you so much in advance.

Best,

David

Matteopaluh commented 1 year ago

Hi David, I guess all the errors are connected and have a few ideas on which could be the problem.

FIrst of all, you're possibly be using KoFamKOALA annotation in a format different from the one accepted by KEMET (have a look at the example files for reference), which I believe is the cause of point 1 on your errors list. EDIT: the files you're using are called in two different ways, i.e. "MAG1" and "MAG2,", I assume this is a typo, otherwise that could also point to another error

Second, I would ask if you're using the later versions of KEMET (i.e. you've cloned the repository after the commit cef9acb performed on July 8th - here) KEGG developers tweaked with the url to access their API. This could have interefered with downloading Bacteroidetes gene sequences and resulting in the empty .msa files. Furthermore I'm assuming you've rightfully added a line about which module to check for via HMMs in the module_file.instruction, otherwise that could also be a source of error. Related to the third point, not finding any sequences, the nhmmer command results in an unspecified error, which is something I could look into and handle better (by adding error flags) for further users.

All that being said I would suggest to clone into the latest repository, if not previously done, and starting anew with empty report and HMM folders, as well as using a suited annotation format (have a look at the toy folder). If the problem persist, feel free to send the files via mail and I'll look into it thoroughly!

Best, Matteo

David-Madrigal commented 1 year ago

Hi Matteo,

Thank you for your response.

I apologize, the KEGG annotation file I used was called "MAG1". the "2" That was a typo while writing the previous comment.

For the following, I cloned the latest repository.

I changed the KEGG annotation file extension from .tsv to .txt, as suggested. Additionally, I replaced tab characters for whitespace characters, just in case. Yet, I'm getting the same output:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       0.0     0__10   INCOMPLETE
1.      K00844, K12407, K00845, K00886, K08074, K00918
2.      K01810, K06859, K13810, K15916
3.      K00850, K16370, K00918
4.      K01623, K01624, K11645, K16305, K16306
5.      K01803
6.      K00134, K00150, K11389
7.      K00927, K11389
8.      K01834, K15633, K15634, K15635
9.      K01689
10.     K00873, K12406

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       0.0     0__6    INCOMPLETE
1.      K01803
2.      K00134, K00150, K11389
3.      K00927, K11389
4.      K01834, K15633, K15634, K15635
...

Alternatively, I used the input to KEGG mapper file and the output is the correct one:

M00001.kk       M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
%       100.0   10__10  COMPLETE

M00002.kk       M00002_Glycolysis, core module involving three-carbon compounds
%       100.0   6__6    COMPLETE

M00003.kk       M00003_Gluconeogenesis, oxaloacetate => fructose-6P
%       100.0   8__8    COMPLETE

M00004.kk       M00004_Pentose phosphate pathway (Pentose phosphate cycle)
%       62.5    5__8    INCOMPLETE
1.      K13937, K00036, K19243
2.      K13937, K01057, K07404
3.      K00033

M00005.kk       M00005_PRPP biosynthesis, ribose 5P => PRPP
%       100.0   1__1    COMPLETE
...

I believe this has something to do with the original detail file in .tsv format (which I got from command-line KofamKOALA). Due to this, I end up using the mapper format instead.

For the HMM search module, I kept having the same traceback error as before. I realized that the traceback error shows " KeyError: '>SB_biofilm_MAG_1' ", but my files were named as "SB_biofilm_MAG1.fa" and "SB_biofilm_MAG1.txt" respectively (i.e., with an additional underscore). Changing the file names to "SB_biofilm_MAG01" did the trick. Maybe that ending underscore has something to do with it.

Although everything is running smoothly now, several .msa files are still empty, resulting in non-exsistent .hmm files. I inspected several of the KEGG IDs (e.g., K22896) on the KEGG website and found that no entries were associated to the input phylum (in this case, Bacteroidetes). Hence, it makes sense that the .msa files were empty.

Thank you very much once again!

Best, David

Matteopaluh commented 1 year ago

Great! Now that I see it

Since the command you launched did not have the "quiet" argument (-q), the soft errors from MAFFT and HMMER were still printed on screen, as in your point 2! I didn't see it because I'm used to the quiet setting myself..

The genes that are not present even once in the KEGG genes of the phylum of choice result in blank .msa, I get it is somehow redundant but without any priors it was the choice I made to have the full spectrum of genes. KEGG database becomes bigger by the months and gets reorganised in terms of new orthologs being split from general to specifics to some phylogenetic groups.

I'm not sure if the file name ending with an underscore was the real deal, as the key error traceback refers to the names of the contig fasta headers, but anyhow glad it worked!

For the sake of good practice and to avoid other similar issues, could I still ask you the two files, to sort out the initial problem with the annotation file? This will likely help KoFamKOALA users, as that program really speeds up functional annotation matters.

Best, Matteo

David-Madrigal commented 1 year ago

Sure!

I'm attaching both original files here. "SB_biofilm_MAG1.txt" is the KEGG annotation file and "SB_biofilm_MAG_1_FASTA.txt" is the .fa file (I changed the name and extension since GitHub does not accept .fa files).

SB_biofilm_MAG1.txt SB_biofilm_MAG_1_FASTA.txt

Best, David

Matteopaluh commented 1 year ago

Ok, looking thoroughly into the KEGG annotation file format I got this was a two-sided problem.

First, the file you shared encodes an erroneous information in line#2 with respect to how that is handled in KEMET i.e. dashes do not represent how following lines fields are structured, and kemet.py uses that as info to parse the annotation and recover KOs.

Second, the code handles wrongly the spacings, so I'm patching that asap. And thanks a lot for this "test", I only used the two Kofam files obtained from the web-server so I was missing this use case!

Just the last question before I'm resolving this, which --format option did you use for kofam_scan? Secondly, have you edited in any way the file obtained from the command-line to obtain the one you shared?

Best, Matteo

David-Madrigal commented 1 year ago

Yes! That file was edited, since I got .tsv format from KofamKOALA (command-line), and I was trying to adjust to the .txt format by making a whitespace substitution to the tab values.

I'm sharing with you the original .tsv file. This one was not edited.

SB_biofilm_MAG_1_TSV.txt

Matteopaluh commented 1 year ago

I'll close this issue after the fix performed on July.

If you'd need some other help please reopen. Best, Matteo

Matteopaluh / KEMET

Kofamscan format problem #10