WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

KEGG Database preparation for DRAM #305

Closed jrr-microbio closed 9 months ago

jrr-microbio commented 9 months ago

key_words: gene_ko_link_loc, kegg_loc, empty distill, prepare_database

I have been trying to install DRAM and ran into a few blunders and figured I could post a guide here in case it is of use to others.

If you are having issues where you have KEGG installed but your distillation comes up empty, or if you are having issues installing your KEGG databases, the workflow here may help. When DRAM was built, KEGG IDs looked differently than they currently look. DRAM depends on KEGG pretty heavily for pulling K0's to populate the product.html and distillation steps as has been documented across a few issues raised on GitHub.

When you have access to a KEGG license, you can download *.pep files for multiple organisms types, which are in the format file.pep (i.e., prokaryote.pep). In some versions of KEGG, these headers contain: 1) the KEGG Genes identifier (organism:gene), 2) a K0 definition that is comma separated (sometimes an E.C. number, sometimes just ID's), and 3) a semi colon that separates out gene id information.

In order for KEGG to work on DRAM, it needs KEGG.pep file headers to be in the following format: column 1 - KEGG GENES identifier (in the form of org:gene) followed by a space column 2 - A kegg id in the format (k\d\d\d\d\d) parentheses included followed by a space column 3 - KO definition with EC numbers that have format \[EC:\d*.\d*.\d*.\d*\] column 4 - a semi colon ';' a semi colon that separates out gene id information.

That means that for some KEGG versions, column 2 is not defined, and they do not link their KEGG GENE identifier to a K0 number. However, there is an additional file that can be downloaded from KEGG called "ko_genes_list". This file contains 2 tab delimited columns that have 1) Kegg ID in the format (k/d/d/d/d/d), and 2) the KEGG GENES identifier. For example: "ko:K14021 hsa:2579". To get DRAM to pull K0 values properly, you need to merge the header IDs from ko_genes_list into the sequence headers of the fasta file to follow the DRAM format. The following code below can help with this, however there are certainly other ways to solve it. For example, see an old issue thread from DRAM devs here (https://github.com/WrightonLabCSU/DRAM/issues/182). While this code worked for me, it may not work for you! Different kegg database versions may be formatted differently, so make sure that you check your output!

Example function that can help merge KEGG K0s and KEGG GENE IDs

def parse_mapping_file(mapping_file):
    identifier_mapping = {}
    with open(mapping_file, 'r') as mapping_fh:
        for line in mapping_fh:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                identifier_2, identifier_1 = parts
                if identifier_1 not in identifier_mapping:
                    identifier_mapping[identifier_1] = [[identifier_2]]
                else:
                    identifier_mapping[identifier_1].append([identifier_2])
    return identifier_mapping

def update_fasta_file(fasta_file, identifier_mapping):
    updated_lines = []
    with open(fasta_file, 'r') as fasta_fh:
        line_buffer = ""
        for line in fasta_fh:
            if line.startswith('>'):
                if line_buffer:
                    # Process the buffered line before the current line
                    updated_lines.extend(process_buffered_line(line_buffer, identifier_mapping))
                line_buffer = line
            else:
                line_buffer += line
        if line_buffer:
            # Process the last buffered line
            updated_lines.extend(process_buffered_line(line_buffer, identifier_mapping))

    # Write the updated content to the output file "modified_kegg.fasta"
    output_file = "modified_kegg_final.fasta"
    with open(output_file, 'w') as output_fh:
        output_fh.writelines(updated_lines)

def process_buffered_line(line, identifier_mapping):
    parts = line.strip().split(' ')
    identifier_1 = parts[0][1:]
    updated_lines = []
    if identifier_1 in identifier_mapping:
        associations = identifier_mapping[identifier_1]
        for association in associations:
            identifier_2_text = ' ' + ' '.join(f'({id2})' for id2 in association)
            updated_line = f">{identifier_1}{identifier_2_text} {' '.join(parts[1:])}\n"
            updated_lines.append(updated_line)
    else:
        identifier_2_text = ' ()'  # If identifier_1 is not found, add empty parentheses
        updated_line = f">{identifier_1}{identifier_2_text} {' '.join(parts[1:])}\n"
        updated_lines.append(updated_line)

    return updated_lines

def main():
    fasta_file = "./KEGG_euk_prot_db.pep"
    mapping_file = "./ko_genes.list"

    identifier_mapping = parse_mapping_file(mapping_file)

    update_fasta_file(fasta_file, identifier_mapping)

    print("File 'modified_kegg_final.fasta' has been created with the updated content.")

if __name__ == "__main__":
    main()

Once you do your merge, confirm that your files now look like the following example header.

hsa:392133 (ko:K04257) OR10AC1, OR10AC1P; olfactory receptor family 10 subfamily AC member 1 (gene/pseudogene)

If you had genes that had a single kegg_gene_id that hit to multiple ko_ids, make sure that these genes are written out multiple times in your output file, once with each ko_id variation (thanks Mick Watson!!)

Note, not all genes in KEGG have K0s associated with them...and sometimes between versions KEGG identifiers are not linked to ko IDs. If you are having any specific issues with genes / functions, check your version and report that when you open an issue on GitHub so they can figure out if it's a dev issue or a database issue.

Now that you have the appropriately modified headers in your fasta.pep, you can feed that file to the DRAM prepare database command as follows:

DRAM-setup.py prepare_databases --output_dir ./DRAM_databases --kegg_loc=modified_kegg_pep.fasta

It will give a warning that you did not specify "--gene_ko_link_loc" which is fine because we just bypassed it by modifying it ourselves. That should successfully build your KEGG (and other) databases, and you should be ready to go!

jrr-microbio commented 9 months ago

Closing as we figured this out. Wanted to post for others to search within github issues.

mw55309 commented 8 months ago

What about genes with multiple KOs?

jrr-microbio commented 8 months ago

Hmm from what I can tell, while some genes can have multiple KO id's, I believe the goal of DRAM setup databases is to make a mmseqs database from a modified KEGG fasta (or pep file) and then subsequently ameliorate those into the DRAM KEGG database. I deduce this from this snippet here: https://github.com/WrightonLabCSU/DRAM/blob/3ae79321783d9603aad92178c2665c0a14298f4b/mag_annotator/database_processing.py#L233

So for example: In my ko-list file, vg:5602197 and vg:7750918 both have K0 ko:K00857. Those would be turned into unique entries: vg:7750918 (ko:K04257) XXX... vg:56021973 (ko:K04257) XXX...

When I look into my built database of kofam_list.tsv, I see a single entry for K00857 as a thymidine kinase [EC:2.7.1.21]

jrr-microbio commented 8 months ago

For what it's worth - I can confirm that multiple entries that might be a relevant hits get transfered over into the final annotations file, and you can tell them apart. See this example K0 ID (K07068) from one of my current annotations that used the KEGG database setup as above:

kegg_genes_id ko_id kegg_hit
smaz:LH19_26405 K07068 smaz:LH19_26405 (ko:K07068) hypothetical protein
rpj:N234_03205 K07068 rpj:N234_03205 (ko:K07068) hypothetical protein
ngv:CDO52_09515 K07068 ngv:CDO52_09515 (ko:K07068) hypothetical protein

It specifies what the best hit was for each of the suspected K0s that encode for that hypothetical protein.

mw55309 commented 8 months ago

Thanks for the reply!

I am more thinking where a single gene has multiple KOs.

Let's say hypothetical example where gene1 has two KO identifiers K12345 and K23456

Would your code modify the FASTA headers to be:

>gene1 (ko:K12345) (ko:K23456)

And the would DRAM pick these up?

jrr-microbio commented 8 months ago

Aaah I understand your question now. Apologies for the delay, and many thanks for the catch!! Was not aware of these KEGG cases.

The code I posted above will miss those and will just take the first instance it finds. I was not aware this was a possible use case since I tested on some subsets which missed those fringe-cases (seems to only be a case for 0.7% of total kegg_genes_ID in my list). This code below (thanks stackoverflow and chatgpt) should solve that, and I'll update it above as well.

Code should output the genes with multiple KOs as such:

>gene1 (ko:K12345)
AKHPSLKPSTEVED
>gene1 (ko:K23456)
AKHPSLKPSTEVED

Then, DRAM should just pull out whichever individual K0 is identified per line to run mmseqs on. Once this is built, if that gene gets a hit in the output, I believe it should output two K0 values similar to what it does for GHs (for example: GH16; GH16_21; GH16_25; GH16_27; GH16_3). I don't have the bandwidth to fully test the new database right now so if you would like to try and run code below through the database creation process and see if they hit back to both original hits, I'm sure others would appreciate! For what it's worth, I am not associated with DRAM anymore and am now a post-doc elsewhere / only worked on this because I was trying to get DRAM installed over here. lol

For code below, reminder for all to change these vars to whatever yours are called. fasta_file = "./KEGG_euk_prot_db.pep" mapping_file = "./ko_genes.list"

def parse_mapping_file(mapping_file):
    identifier_mapping = {}
    with open(mapping_file, 'r') as mapping_fh:
        for line in mapping_fh:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                identifier_2, identifier_1 = parts
                if identifier_1 not in identifier_mapping:
                    identifier_mapping[identifier_1] = [[identifier_2]]
                else:
                    identifier_mapping[identifier_1].append([identifier_2])
    return identifier_mapping

def update_fasta_file(fasta_file, identifier_mapping):
    updated_lines = []
    with open(fasta_file, 'r') as fasta_fh:
        line_buffer = ""
        for line in fasta_fh:
            if line.startswith('>'):
                if line_buffer:
                    # Process the buffered line before the current line
                    updated_lines.extend(process_buffered_line(line_buffer, identifier_mapping))
                line_buffer = line
            else:
                line_buffer += line
        if line_buffer:
            # Process the last buffered line
            updated_lines.extend(process_buffered_line(line_buffer, identifier_mapping))

    # Write the updated content to the output file "modified_kegg.fasta"
    output_file = "modified_kegg_final.fasta"
    with open(output_file, 'w') as output_fh:
        output_fh.writelines(updated_lines)

def process_buffered_line(line, identifier_mapping):
    parts = line.strip().split(' ')
    identifier_1 = parts[0][1:]
    updated_lines = []
    if identifier_1 in identifier_mapping:
        associations = identifier_mapping[identifier_1]
        for association in associations:
            identifier_2_text = ' ' + ' '.join(f'({id2})' for id2 in association)
            updated_line = f">{identifier_1}{identifier_2_text} {' '.join(parts[1:])}\n"
            updated_lines.append(updated_line)
    else:
        identifier_2_text = ' ()'  # If identifier_1 is not found, add empty parentheses
        updated_line = f">{identifier_1}{identifier_2_text} {' '.join(parts[1:])}\n"
        updated_lines.append(updated_line)

    return updated_lines

def main():
    fasta_file = "./KEGG_euk_prot_db.pep"
    mapping_file = "./ko_genes.list"

    identifier_mapping = parse_mapping_file(mapping_file)

    update_fasta_file(fasta_file, identifier_mapping)

    print("File 'modified_kegg_final.fasta' has been created with the updated content.")

if __name__ == "__main__":
    main()

I will say, though, I am also not sure if the ko_genes.list file has changed significantly in the latest versions as I only had access to one of them. The code above is for my case specifically, I posted so folks would know where the "issue" was (i.e., it's a header issue). For what it's worth, I know they are actively working on this and getting DRAM2 ready to make this process easier. That being said, since the database is built off of multiples of each gene, I would not be "too" concerned with a few missing values.

jianshu93 commented 2 months ago

hello all,

In academia, I cannot believe I have to pay for kegg, I mean it should be free for academic use. The only software I know that we also need to pay for academic use is USEARCH and It is only $800, USEARCH is an amazing tool with 100,000 lines of C++ code (basically 10 years+ code for an experienced C++ programmer). We should fight this, make an open source version (like vsearch, the open source usearch) because I believe kegg use many of the resources and protein information from NCBI (which is free anyway) without paying for it, from this point of view, it is benefiting from the community without giving the benefit back, which is a shame and I think they should be kicked out of academia. As a strong supporter for open source (think about Linux kernel and Git, nothing of those software/database will be possible without the two amazing open source project), I cannot understand and agree the kegg academic policy.

Jianshu