Matteopaluh / KEMET

KEGG Module Evaluation Tool
Other
23 stars 5 forks source link

Module completeness as stand-alone package #19

Open Alxdu opened 8 months ago

Alxdu commented 8 months ago

First of all, thank you for putting together this really great package. I find the module completeness assessment really unique, with only a few other lesser options out there (e.g., KeggDecoder). I also liked the way you break down the module definition in .kk files for improved completeness assessment. Therefore, I look forward to see continued support and development for this function.

In my case, I use ko annotations made within a different pipeline to assess module completeness with KEMET. In theory I would only need the annotation .txt file, but I have to also provide the genome assembly .fasta file to run the script (which is not really needed when running with --skip_hmm and --skip_gsmm arguments).

If I could make a feature request/suggestion, it would be to separate the module completeness functionality where it accepts just ko annotation files (either a path to a file or a path to a folder for batch operation).

It would also be great to have a stand-alone tool to create module definition .kk files from the official kegg module .txt files, for situations where KEMET is not continuously supported and current .kk files become obsolete.

Thank you for giving these some consideration.

jolespin commented 8 months ago

I would also like this feature.

@Alxdu have you found any alternatives?

Matteopaluh commented 8 months ago

Hello both of you,

Indeed KEMET was conceived and structured in 3 different scripts, but at the time of first manuscript submission to a journal, one reviewer suggested to bundle all functions in a single package.

Due to this, the design of the main script was reworked and it's now in the present form, but lines 2444-2495 are remnants of the initial concept about Module annotation alone.

I've briefly checked the code of kemet.py and just found a minor code rework that could permit using a workaround in bash language, allowing for batch annotation of KOs without FASTA sequences - given the presence of suitable annotation files.

The script is not specifically asking for FASTA files as input, but it's using file names of said files to keep a constant flow for all operations connected to the same MAGs/genomes.

That is to say that if --skip_hmm parameter is added, mandatory FASTA_file argument can be a path to the annotation folder. A couple of lines of code would be sufficient to rename the variable file_name checked in lines 2452, 2458, and 2464. That way a simple

for f in $(ls PATH/TO/ANNOTATION-FILES/);
do
    ./kemet.py $f -a ANNOTATION_FORMAT --skip_hmm;
done

should work for batch annotation.

In the meantime I guess another workaround could be to truncate the names of KO annotations files with a code like:

for f in $(ls PATH/TO/ANNOTATION-FILES/);
do
    f1="${f%%.*}";
    ./kemet.py $f1 -a ANNOTATION_FORMAT --skip_hmm;
done

For single file annotation, instead of pointing to fasta files path, it is possible to point to an annotation file, with the exception of leaving out the extension.

I'll work on the solution I mentioned in this reply, to include single file and batch use cases, soon when I'll be available!

@Alxdu regarding the tool to create module definition it could be available, but it would take a while more. I already have some code for that, which was used as backbone for the most of .kk files but it still needs some manual curation for a minority of them. Therefore I was figuring out a way to eliminate this manual curation on the code, and in the meanwhile I had updated to the second to last KEGG version. I'll also try to do the same for the last one in the close future.

Best,

Matteo

jolespin commented 8 months ago

@Matteopaluh this is great news. Would also be possible to include some functionality that takes in something like just a list of KO ids? Something like this:

for GENOME_ID in $(cat genomes.list);
do
    KO_IDS=kofam_results/${GENOME_ID}.ko_ids.list
    kemet.py $KO_IDS -a ko_list > kemet_results/${GENOME_ID}.mcr.tsv
done

If you're able to implement this functionality and add the module as a conda package I will incorporate it into my https://github.com/jolespin/veba package. I'm working on the v2 publication right now so your package of course would be cited and properly referenced.

What would be very useful would be to give kemet a list of KO ids that are present in a particular genome and get an output that says the KEGG module and the module completion ratio (plus any extra data).

How difficult would this be on your end to make this type of update?

Alxdu commented 8 months ago

@Matteopaluh is it excellent to hear you intend to revisit and improve upon the module completeness functionality. I will have a go at your suggested code modifications as a workaround, but I also look forward to your own implementation in upcoming updates. Same goes for module definition tooling (i.e., rebuilding .kk file). What you have is original and a fairly unique offering for Kegg users. Nice work