Matteopaluh / KEMET

KEGG Module Evaluation Tool
Other
25 stars 6 forks source link

Equivocal README & file-naming problems #13

Closed ttubb closed 1 year ago

ttubb commented 1 year ago

Hello and thanks for creating this software,

I have gene-to-ko annotations for all my MAGs. I would like to use KEMET to calculate the completeness of KEGG modules for these MAGs.

Unfortunately, i have not yet managed to do so. I think the instructions in README.md are not up-to-date. The file setup.py is mentioned in multiple places but seems to be missing from the repository. It is unclear to me why i cannot run the tool without providing a FASTA file when I'm using --skip_hmm and --skip_gsmm. The help text references the genomes.instruction file in this context, but that one is also not part of the repository.

I'm also not sure if a am providing KO annotations in the right format. For each MAG, i created a tab-separated file with gene identifiers in the first column and KOs (e.g. K24042) in the second column. They are named bin1_ko.txt, bin2_ko.txt etc. If one gene has multiple KO annotations, the file will contain one row for each of those annotations. Is this approach correct? What would i put for --annotation_format? If my approach is incorrect, can you give me an example of how i should format my input to match one of the valid annotation formats?

Thank you very much for any help.

Kind Regards, Tom

ttubb commented 1 year ago

I'll add an example of how I'm trying (and failing) to use KEMET currently. I have two folders set up, /mnt/meta/kemet/KEGG_annotations/ contains gene-to-ko annotations

bin.1_ko.txt
bin.2_ko.txt
...etc.

/mnt/meta/kemet/mag_fasta/ contains predicted proteins for each MAG

bin.1.faa
bin.2.faa
...etc

I navigate to /mnt/eph/software/KEMET/kemet.py and run the command:

python3.8 /mnt/eph/software/KEMET/kemet.py \
--annotation_format kaas \
-I /mnt/meta/kemet/KEGG_annotations/  \
--path_output /mnt/meta/kemet/output \
--skip_hmm \
 --skip_gsmm \
 /mnt/meta/kemet/mag_fasta/bin.1.faa

This fails, the error i get is:

Traceback (most recent call last):
  File "/mnt/eph/software/KEMET/kemet.py", line 2449, in <module>
    os.chdir(ktests_directory)
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/meta/kemet/output//ktests/'

The program works when I specify /mnt/meta/kemet/mag_fasta/ instead of /mnt/meta/kemet/mag_fasta/bin.1.faa for the last argument. The whole command would look like this:

python3.8 /mnt/eph/software/KEMET/kemet.py \
--annotation_format kaas \
-I /mnt/meta/kemet/KEGG_annotations/  \
--path_output /mnt/meta/kemet/output \
--skip_hmm \
 --skip_gsmm \
 /mnt/meta/kemet/mag_fasta/

However, it will only produce output for a single bin (which is always bin 12. Bin 12 is the last entry of the list produced by os.listdir('/mnt/meta/kemet/KEGG_annotations/')).

When running like this, the tool does create the ktests directory and populates it with files for every single bin. When i try running the initial command again (where I specify one single bin) it gives me the following error:

Traceback (most recent call last):
  File "/mnt/space/software/KEMET/kemet.py", line 2541, in <module>
    if ktest in sorted(os.listdir()):
NameError: name 'ktest' is not defined

I would be excited to get any help. This tool would be useful for me and my colleagues and i would like to integrate it into our nextflow pipelines.

Matteopaluh commented 1 year ago

First of all, thank you for the kind words Tom!

Indeed README.md is lagging a little with respect to the actual commands, and I'll fix this as soon as I can, thanks! Nonetheless, I noticed that you worked your way around it anyway, using --annotation_format kaas; hope it didn't take much.

First thing I also got from your second message, though, is that you're using the predicted proteomes (.faa) as input argument instead of MAGs/contig-list-of-any-sorts (.fa, .fasta, .fna). I think this could interfere with file naming, due to pattern replacements. While skipping HMM and GSMM parts of the script, MAG's filename is only used to get the naming pattern for the KEGG annotations. When I was using the script in a sort of a "batch" fashion, for a big dataset I used it as such:

for f in genomes/*.fa; do ./kemet.py $f -a eggnog --skip_hmm -q --log; done 

I would gladly help with a more extensive debug if you could provide, even privately, a pair of files (bin and annotations) to test.

I don't know that well how nextflow pipelines work, but if you or your collegues can make it work please do so! 😊

Best regards, Matteo

ttubb commented 1 year ago

Thank you for providing support.

I figured out the .faa issue after looking at the code of the main kemet.py script. After changing the file extensions everything seemed to work. I did this thinking KEMET needs the identifiers of predicted genes (to relate those to the information in the KO files).

I now assume this is not the case and will simply re-run using .fasta files with genomes. I will inspect the results and report back if there are still any issues.

My KO-annotations look like this:
C81DXA_C68Y_3 K22106
QHCES8_C68Y_4 K07079
... ...

Tab-separated, headerless, missing unannotated genes. And without a way to relate the gene identifiers to information in the fasta files. Hope that is alright and works with the --kaas flag.

Cheers!

Matteopaluh commented 1 year ago

No problem!

Great that was the case, and the problem is probably solved, I'm closing this issue but feel free to reopen! (I'm also updating README and wiki following this issue)

Tab-separated, headerless, missing unannotated genes. And without a way to relate the gene identifiers to information in the fasta files. Hope that is alright and works with the --kaas flag.

Yes, KEMET is only taking into consideration which KO annotations are present/absent, without the need for gene identifier.. I assume you could map it back though! Minor comment, the flag is -a kaas or --annotation_format kaas.

This tool would be useful for me and my colleagues and i would like to integrate it into our nextflow pipelines.

Looking forward for yours and your team job on this then!

Best, Matteo

ttubb commented 1 year ago

Again, thanks for your answer! I ran into another issue. Although after some investigation I believe this is due to a small bug in KEMET.

My MAGs are numbered 1 through 300. I observed the problem for a small subset of these. Let's say i want to process bin.11.fa. I run

kemet.py \
        --annotation_format kaas \
        -I annotations/ \
        --path_output kemet_out/bin11 \
        --skip_hmm \
        --skip_gsmm \
        --verbose \
        mag_fasta/bin.11.fa

KEMET will now create ktests files for MAGs 110 through 119 (not for the actual MAG 11) and build a report based on bin.114_ko.tsv

I presume the issue is the use of file.startswith() in line 2438 of kemet.py. Though from a quick glance I'm not sure why this would not cause bin.11.fa to be processed alongside 110 through 119.

I circumvented the problem by adding leading zeroes to my filenames.

Cheers, Tom

(I cannot reopen the issue)