Closed David-Madrigal closed 1 year ago
Hi David, I guess all the errors are connected and have a few ideas on which could be the problem.
FIrst of all, you're possibly be using KoFamKOALA annotation in a format different from the one accepted by KEMET (have a look at the example files for reference), which I believe is the cause of point 1 on your errors list. EDIT: the files you're using are called in two different ways, i.e. "MAG1" and "MAG2,", I assume this is a typo, otherwise that could also point to another error
Second, I would ask if you're using the later versions of KEMET (i.e. you've cloned the repository after the commit cef9acb
performed on July 8th - here) KEGG developers tweaked with the url to access their API. This could have interefered with downloading Bacteroidetes gene sequences and resulting in the empty .msa
files.
Furthermore I'm assuming you've rightfully added a line about which module to check for via HMMs in the module_file.instruction
, otherwise that could also be a source of error.
Related to the third point, not finding any sequences, the nhmmer
command results in an unspecified error, which is something I could look into and handle better (by adding error flags) for further users.
All that being said I would suggest to clone into the latest repository, if not previously done, and starting anew with empty report and HMM folders, as well as using a suited annotation format (have a look at the toy folder). If the problem persist, feel free to send the files via mail and I'll look into it thoroughly!
Best, Matteo
Hi Matteo,
Thank you for your response.
I apologize, the KEGG annotation file I used was called "MAG1". the "2" That was a typo while writing the previous comment.
For the following, I cloned the latest repository.
M00001.kk M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
% 0.0 0__10 INCOMPLETE
1. K00844, K12407, K00845, K00886, K08074, K00918
2. K01810, K06859, K13810, K15916
3. K00850, K16370, K00918
4. K01623, K01624, K11645, K16305, K16306
5. K01803
6. K00134, K00150, K11389
7. K00927, K11389
8. K01834, K15633, K15634, K15635
9. K01689
10. K00873, K12406
M00002.kk M00002_Glycolysis, core module involving three-carbon compounds
% 0.0 0__6 INCOMPLETE
1. K01803
2. K00134, K00150, K11389
3. K00927, K11389
4. K01834, K15633, K15634, K15635
...
Alternatively, I used the input to KEGG mapper file and the output is the correct one:
M00001.kk M00001_Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate
% 100.0 10__10 COMPLETE
M00002.kk M00002_Glycolysis, core module involving three-carbon compounds
% 100.0 6__6 COMPLETE
M00003.kk M00003_Gluconeogenesis, oxaloacetate => fructose-6P
% 100.0 8__8 COMPLETE
M00004.kk M00004_Pentose phosphate pathway (Pentose phosphate cycle)
% 62.5 5__8 INCOMPLETE
1. K13937, K00036, K19243
2. K13937, K01057, K07404
3. K00033
M00005.kk M00005_PRPP biosynthesis, ribose 5P => PRPP
% 100.0 1__1 COMPLETE
...
I believe this has something to do with the original detail file in .tsv format (which I got from command-line KofamKOALA). Due to this, I end up using the mapper format instead.
Although everything is running smoothly now, several .msa files are still empty, resulting in non-exsistent .hmm files. I inspected several of the KEGG IDs (e.g., K22896) on the KEGG website and found that no entries were associated to the input phylum (in this case, Bacteroidetes). Hence, it makes sense that the .msa files were empty.
Thank you very much once again!
Best, David
Great! Now that I see it
Since the command you launched did not have the "quiet" argument (-q), the soft errors from MAFFT and HMMER were still printed on screen, as in your point 2! I didn't see it because I'm used to the quiet setting myself..
The genes that are not present even once in the KEGG genes of the phylum of choice result in blank .msa, I get it is somehow redundant but without any priors it was the choice I made to have the full spectrum of genes. KEGG database becomes bigger by the months and gets reorganised in terms of new orthologs being split from general to specifics to some phylogenetic groups.
I'm not sure if the file name ending with an underscore was the real deal, as the key error traceback refers to the names of the contig fasta headers, but anyhow glad it worked!
For the sake of good practice and to avoid other similar issues, could I still ask you the two files, to sort out the initial problem with the annotation file? This will likely help KoFamKOALA users, as that program really speeds up functional annotation matters.
Best, Matteo
Sure!
I'm attaching both original files here. "SB_biofilm_MAG1.txt" is the KEGG annotation file and "SB_biofilm_MAG_1_FASTA.txt" is the .fa file (I changed the name and extension since GitHub does not accept .fa files).
SB_biofilm_MAG1.txt SB_biofilm_MAG_1_FASTA.txt
Best, David
Ok, looking thoroughly into the KEGG annotation file format I got this was a two-sided problem.
First, the file you shared encodes an erroneous information in line#2 with respect to how that is handled in KEMET
i.e. dashes do not represent how following lines fields are structured, and kemet.py
uses that as info to parse the annotation and recover KOs.
Second, the code handles wrongly the spacings, so I'm patching that asap. And thanks a lot for this "test", I only used the two Kofam files obtained from the web-server so I was missing this use case!
Just the last question before I'm resolving this, which --format
option did you use for kofam_scan? Secondly, have you edited in any way the file obtained from the command-line to obtain the one you shared?
Best, Matteo
Yes! That file was edited, since I got .tsv format from KofamKOALA (command-line), and I was trying to adjust to the .txt format by making a whitespace substitution to the tab values.
I'm sharing with you the original .tsv file. This one was not edited.
I'll close this issue after the fix performed on July.
If you'd need some other help please reopen. Best, Matteo
Hi,
Thank you for developing KEMET. I'm very interested in making use of the three main modules included in this package. Nonetheless, I'm facing a couple of issues and I kindly request some assistance.
For testing purposes, I'm currently working with a high-quality MAG with filename
KEMET/genomes/SB_biofilm_MAG_1_.fa
. KEGG annotations were performed with KofamKOALA, and were included as a tsv file (KEMET/KEGG_annotations/SB_biofilm_MAG_2_.tsv
). I'm running the--hmm_mode modules
option with "M00001" as the only input for the "module_file.instruction" file . The "genomes.instruction" file contains the following (tab separated):Currently I'm running the following command in the KEMET directory:
./kemet.py genomes/SB_biofilm_MAG_1_.fa -a kofamkoala --log --hmm_mode modules --skip_gsmm
Issues:
--hmm_mode modules
option with "M00001". While running the HMM search function, the following is printed:I looked for the *.msa files at their respective directories, and it seems that the files are blank. Consequently, the following is printed on screen:
If necessary, I would gladly share via e-mail the original nucleotide fasta and KEGG annotations files.
Thank you so much in advance.
Best,
David