cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
137 stars 27 forks source link

How can I compute module completion ratios given a set of KEGG orthologs? #9

Open jolespin opened 3 years ago

jolespin commented 3 years ago

Hi Carlos,

First off, thank you so much for putting in all of the work to develop this tool. I have been looking for something similar to this for years and I finally need to publish some papers that rely on this methodology. I have my data in the following format in Python dictionaries:

organism_to_orthologs = {
"Species_A": {"K00855","K01601","K01602","K00927","K05298","K00150","K00134"},
...
}

keggmodule_to_definition = {
"M00357": "K00855 (K01601-K01602) K00927 (K05298,K00150,K00134) (K01623,K01624) (K03841,K02446,K11532,K01086) K00615 (K01623,K01624) (K01100,K11532,K01086) K00615 (K01807,K01808)"
...
}

Is there any way I can use any of your existing package or scripts to accomplish the following function:

def module_completion_ratio(definition:str, orthology_set:set):
   """
   definition: the KEGG module definition
   orthology_set: a set of KEGG orthologs for a particular grouping
  """
   # So much empty...
    mcr = None
    return mcr

Of course, anything I use from your package will be cited and any help will be acknowledged.

Any help would be greatly appreciated.

Thank you, -Josh

cruizperez commented 3 years ago

Hi Josh, Let me ask you a couple of questions to see if I understand correctly.

Carlos

jolespin commented 3 years ago

Hi Carlos, I appreciate you getting back to me on this!

Are you interested in a particular subset of modules?

Not necessarily. Our institute uses an older version of KEGG for some of the pipelines and I realized that many of the KEGG orthologs and modules were discontinued. Last week I reannotated using the newest database from KEGG's KOFAMSCAN. To get the modules associated with each ortholog, I wrote the following function which returns a set modules associated with each ortholog (not very well documented, apologies):

import re
from urllib.error import HTTPError
from Bio.KEGG.REST import kegg_get

def get_module_from_ortholog(id_ko):
    try:
        text = kegg_get(f"ko:{id_ko}").read()
        module_lines = list(filter(lambda x:x.startswith("MODULE"), text.split("\n")))
        modules = set()
        for x in module_lines:
            match = re.search("M\d{5}", x)
            if match is not None:
                id_module = match.group()
                modules.add(id_module)
            else:
                print(x, file=sys.stderr)
        return modules
    except HTTPError:
        print(id_ko, file=sys.stderr)
        return None

Upon doing so, I checked the overlap of orthologs between your databases:

modules_associated_with_orthologs = {'M00001', 'M00003', 'M00004', 'M00005', 'M00008', 'M00009', 'M00012', 'M00013', 'M00014', 'M00015', 'M00016', 'M00017', 'M00018', 'M00019', 'M00020', 'M00021', 'M00022', 'M00023', 'M00024', 'M00025', 'M00026', 'M00027', 'M00028', 'M00029', 'M00030', 'M00033', 'M00034', 'M00035', 'M00036', 'M00038', 'M00039', 'M00044', 'M00045', 'M00046', 'M00047', 'M00048', 'M00049', 'M00050', 'M00051', 'M00052', 'M00053', 'M00055', 'M00059', 'M00060', 'M00061', 'M00063', 'M00064', 'M00066', 'M00075', 'M00076', 'M00078', 'M00079', 'M00081', 'M00082', 'M00083', 'M00086', 'M00087', 'M00088', 'M00089', 'M00091', 'M00093', 'M00095', 'M00096', 'M00097', 'M00098', 'M00099', 'M00114', 'M00115', 'M00116', 'M00117', 'M00118', 'M00119', 'M00120', 'M00121', 'M00122', 'M00123', 'M00124', 'M00125', 'M00126', 'M00127', 'M00129', 'M00131', 'M00133', 'M00134', 'M00135', 'M00136', 'M00140', 'M00144', 'M00151', 'M00153', 'M00154', 'M00155', 'M00156', 'M00157', 'M00159', 'M00165', 'M00168', 'M00169', 'M00173', 'M00175', 'M00176', 'M00307', 'M00345', 'M00346', 'M00356', 'M00357', 'M00358', 'M00364', 'M00373', 'M00374', 'M00375', 'M00377', 'M00417', 'M00432', 'M00525', 'M00526', 'M00527', 'M00528', 'M00529', 'M00530', 'M00532', 'M00533', 'M00540', 'M00545', 'M00546', 'M00548', 'M00549', 'M00550', 'M00552', 'M00554', 'M00555', 'M00564', 'M00565', 'M00567', 'M00569', 'M00570', 'M00572', 'M00575', 'M00577', 'M00579', 'M00596', 'M00609', 'M00615', 'M00616', 'M00627', 'M00631', 'M00632', 'M00642', 'M00651', 'M00652', 'M00698', 'M00705', 'M00714', 'M00718', 'M00725', 'M00726', 'M00740', 'M00769', 'M00778', 'M00793', 'M00824', 'M00835', 'M00842', 'M00846', 'M00847', 'M00848', 'M00849', 'M00850', 'M00855', 'M00867', 'M00872', 'M00875', 'M00876', 'M00878', 'M00880', 'M00881', 'M00883', 'M00892', 'M00897', 'M00898', 'M00899', 'M00903', 'M00909', 'M00911', 'M00916', 'M00922', 'M00923', 'M00924', 'M00925', 'M00926', 'M00930', 'M00931', 'M00936'}

modules_associated_with_orthologs - (set(regular_modules) | set(bifurcation_modules) | set(structural_modules)) 

Those "missing" modules were the following which appear to be a mix of signature and pathway modules:

{'M00615','M00616','M00881','M00883','M00892','M00897','M00898','M00899','M00903','M00909','M00911','M00916','M00922','M00923','M00924','M00925','M00926','M00930','M00931','M00936'}

To answer your question, I'm not necessarily interested in particular modules mostly which modules were represented by a particular set of orthologs and how complete those modules were considering all of the orthologs in the grouping (e.g., a metagenome assembled genome [MAG]). One tricky thing I've realized about KEGG is that its database is constantly changing adding and removing orthologs and modules.

My initial interest was to parse the actual KEGG DEFINITION string because of this variability between versions but upon finding your tool, it seems an unnecessary extra step on my end since you've already gone through the work of doing that.

That said, adding a definition parser function to your package would be pretty useful for the field in general because, other than you, I don't know of anyone or any package that actually implements this correctly!

Do you need to access your organism_to_orthologs sirectly from the dictionary?

It's always nice to avoid command line but I can absolutely format the files so they can be directly used! How should the files be formatted? Is there a header? Is it just something like this?

[COMMAND] cat MAG_1.list
[OUTPUT]
K00134
K00150
K00844
...
K01623

If you can export your organism_to_orthologs to an individual file per organism with a KO identifier per line, you could directly use the ko_mapper.py script within MicrobeAnnotator. It will assume each file is a organism and calculate the module completeness for all modules and report their completion.

This is great! Honestly, the only other tool I know that does did this was MAPLE and that has been discontinued (since Feb 2019) AFAIK. I'm sure once your tool is published it will be very popular. I will definitely implement it in a few institution-wide pipelines at my institute that other labs will use (we do a lot of metagenomics).

If you need to access them directly from the dictionaries, what I would do is take the set of KO identifiers for each key (organism) and pass them directly to the functions called by global_mapper() inside ko_mapper.py (lines 459-471). You would need to slightly modify the functions because they expect a file and you would be passing directly the set.

I'll probably give this a shot first and, if I'm successful, I'll post the code here so you can have it in case you ever need it or it would be useful for anyone else.

I would recommend the first one because it is easier. However, if you need help with the second option let me know and I can include some changes in the code to accomodate for that.

Will do! Again, thanks for your help in all of this. As I mentioned, this is such a key part of metagenomics and there are so few resources that have this capability.

Thanks, -Josh

jolespin commented 3 years ago

This looks like it did the trick:

"""---1.0 Define Functions---"""

def ko_match(string):
    """ Looks if string has the form K[0-9]{5}

    Arguments:
        string {string} -- String to test

    Returns:
        [bool] -- String has form or not
    """
    if re.search(r'^K[0-9]{5}$', string) is not None:
        return 1
    else:
        return 0

def split_compound(string, comp_type):
    """[summary]

    Arguments:
        string {[type]} -- [description]
        comp_type {[type]} -- [description]

    Returns:
        [type] -- [description]
    """
    if comp_type == 'compound':
        return re.split('[-+]', string)
    elif comp_type == 'and_comp':
        return string.split('_')
    elif comp_type == 'or_comp':
        return string.split(',')

def process_compounds(string, comp_type, genome_annotation):
    string = split_compound(string, comp_type)
    proteins_required = len(string)
    proteins_present = 0
    for option in string:
        if option == '':
            proteins_required -= 1
        elif '+' in option or '-' in option:
            compound = split_compound(option, 'compound')
            proteins_in_compound = len(compound)
            present_in_compound = 0
            for sub_option in compound:
                if ko_match(sub_option) > 0 and sub_option in genome_annotation:
                    present_in_compound += 1
            proteins_present += present_in_compound/proteins_in_compound
        else:
            if ko_match(option) > 0 and option in genome_annotation:
                proteins_present += 1
    return proteins_present, proteins_required

def import_module_dict(dictionary_location):
    with open(dictionary_location,"rb") as pickle_in:
        dictionary = pickle.load(pickle_in)
    return dictionary

def regular_module_mapper(module_dictionary, genome_annotation_file):
    if not isinstance(genome_annotation_file, str):
        assert hasattr(genome_annotation_file, "__iter__"), "If `genome_annotation_file` is not a filepath then it must be an iterable of orthologs"
        genome_annotation = set(genome_annotation_file)

    else:
        assert isinstance(genome_annotation_file, str), "If `genome_annotation_file` is not an iterable of orthologs then it must be a string filepath"
        genome_annotation = annotation_file_parser(genome_annotation_file)

    regular_module_completenes = []
    for module, final_steps in module_dictionary.items():
        complete_steps = 0
        total_module_steps = len(final_steps)
        for proteins in final_steps.values():
            score_for_step = 0
            steps_per_option = 100
            match = False
            for option in proteins:
                match = False
                # Search for single gene option
                if ko_match(option) > 0:
                    if option in genome_annotation:
                        score_for_step = 1
                        match = True
                    if match == True:
                        steps_per_option = 1
                    elif 1 < steps_per_option:
                        steps_per_option = 1
                elif '%' in option:
                    option = option.replace(')', '')
                    option = option.split('-%')
                    mandatory = split_compound(option[0], 'compound')
                    proteins_required = len(mandatory) + 1
                    proteins_present = 0
                    for prot in mandatory:
                        if ko_match(prot) > 0 and prot in genome_annotation:
                            proteins_present += 1
                    for prot in option[1].split(','):
                        if ko_match(prot) > 0 and prot in genome_annotation:
                            proteins_present += 1
                            break
                    if proteins_present/proteins_required > score_for_step:
                        score_for_step = proteins_present/proteins_required
                        match = True
                    if match == True:
                        steps_per_option = 1
                    elif 1 < steps_per_option:
                        steps_per_option = 1
                elif '_' in option and ',' in option:
                    option = sorted(split_compound(option, 'and_comp'), key=len)
                    highest_score = 0
                    for element in option:
                        if ko_match(element) > 0 and element in genome_annotation:
                            highest_score += 1
                        elif ',' in element:
                            element = sorted(element.split(","), key=len)
                            for sub_element in element:
                                if ko_match(sub_element) > 0 and sub_element in genome_annotation:
                                    highest_score += 1
                                    break
                                elif '+' in sub_element:
                                    proteins_present, proteins_required = process_compounds(sub_element, 
                                    'compound', genome_annotation)
                                    highest_score += proteins_present/proteins_required
                    if highest_score > score_for_step:
                        score_for_step = highest_score
                        match = True
                    if match == True:
                        steps_per_option = len(option)
                    elif len(option) < steps_per_option:
                        steps_per_option = len(option)
                elif len(option.split(",")) > 1:
                    match = False
                    option = sorted(option.split(","), key=len)
                    highest_score = 0
                    steps_to_add = 0
                    for sub_option in option:
                        if ko_match(sub_option) > 0 and sub_option in genome_annotation:
                            if 1 > highest_score:    
                                highest_score = 1
                                steps_to_add = 1
                        elif '+' in sub_option or '-' in sub_option:
                            proteins_present, proteins_required = process_compounds(sub_option, 'compound', genome_annotation)
                            if proteins_present/proteins_required > highest_score:
                                highest_score = proteins_present/proteins_required
                                steps_to_add = 1
                    if highest_score > score_for_step:
                        score_for_step = highest_score
                        match = True
                    if match == True:
                        steps_per_option = steps_to_add
                    elif steps_to_add< steps_per_option:
                        steps_per_option = steps_to_add
                elif '_' in option:
                    match = False
                    proteins_present, proteins_required = process_compounds(option, 'and_comp', genome_annotation)
                    if proteins_present/proteins_required > score_for_step:
                        score_for_step = proteins_present
                        match = True
                    if match == True:
                        steps_per_option = proteins_required
                    elif proteins_required < steps_per_option:
                        steps_per_option = proteins_required
                elif "+" in option or "-" in option:
                    match = False
                    highest_score = 0
                    proteins_present, proteins_required = process_compounds(option, 'compound', genome_annotation)
                    if proteins_present/proteins_required > score_for_step:
                        score_for_step = proteins_present/proteins_required
                        match = True
                    if match == True:
                        steps_per_option = 1
                    elif proteins_required < steps_per_option:
                        steps_per_option = 1
                else:
                    print("Unreccognized module {}. Check your database.".format(option))
            complete_steps += score_for_step
            if steps_per_option > 50:
                steps_per_option = 1
            if steps_per_option > 1:
                total_module_steps += steps_per_option - 1
        regular_module_completenes.append((module, round((complete_steps/total_module_steps)*100, 2)))
    return regular_module_completenes

def bifurcating_module_mapper(module_dictionary, genome_annotation_file):
    if not isinstance(genome_annotation_file, str):
        assert hasattr(genome_annotation_file, "__iter__"), "If `genome_annotation_file` is not a filepath then it must be an iterable of orthologs"
        genome_annotation = set(genome_annotation_file)

    else:
        assert isinstance(genome_annotation_file, str), "If `genome_annotation_file` is not an iterable of orthologs then it must be a string filepath"
        genome_annotation = annotation_file_parser(genome_annotation_file)

    bifurcating_module_completenes = []
    for module, versions in module_dictionary.items():
        module_highest = 0
        for version, total_steps in versions.items():
            completed_steps = 0
            total_version_steps = len(total_steps)
            for proteins in total_steps.values():
                score_for_step = 0
                steps_per_option = 100
                match = False
                if isinstance(proteins, (list)):
                    protein = sorted(proteins, key=len)
                    for option in protein:
                        match = False
                        if ko_match(option) > 0:
                            if option in genome_annotation:
                                score_for_step = 1
                                match = True
                            if match == True:
                                steps_per_option = 1
                            elif 1 < steps_per_option:
                                steps_per_option = 1
                        elif '_' in option and ',' in option:
                            option = sorted(split_compound(option, 'and_comp'), key=len)
                            highest_score = 0
                            for element in option:
                                if ko_match(element) > 0 and element in genome_annotation:
                                    highest_score += 1
                                elif ',' in element:
                                    element = sorted(element.split(","), key=len)
                                    score_sub_option = 0
                                    for sub_element in element:
                                        if ko_match(sub_element) > 0 and sub_element in genome_annotation:
                                            highest_score += 1
                                            break
                                        elif '+' in sub_element:
                                            proteins_present, proteins_required = process_compounds(sub_element, 
                                            'compound', genome_annotation)
                                            highest_score += proteins_present/proteins_required
                            if highest_score > score_for_step:
                                score_for_step = highest_score
                                match = True
                            if match == True:
                                steps_per_option = len(option)
                        elif '_' in option:
                            match = False
                            proteins_present, proteins_required = process_compounds(option, 'and_comp', genome_annotation)
                            if proteins_present/proteins_required > score_for_step:
                                score_for_step = proteins_present
                                match = True
                            if match == True:
                                steps_per_option = proteins_required
                            elif proteins_required < steps_per_option:
                                steps_per_option = proteins_required
                        elif '+' in option or '-' in option:
                            match = False
                            highest_score = 0
                            proteins_present, proteins_required = process_compounds(option, 'compound', genome_annotation)
                            if proteins_present/proteins_required > score_for_step:
                                score_for_step = proteins_present/proteins_required
                                match = True
                            if match == True:
                                steps_per_option = 1
                            elif proteins_required < steps_per_option:
                                steps_per_option = 1
                elif ko_match(proteins) > 0:
                    match = False
                    if proteins in genome_annotation:
                        score_for_step = 1
                        match = True
                    if match == True:
                        steps_per_option = 1
                    elif 1 < steps_per_option:
                        steps_per_option = 1
                elif ',' in proteins:
                    options = split_compound(proteins, 'or_comp')
                    for option in options:
                        if ko_match(option) > 0 and option in genome_annotation:
                            score_for_step = 1
                            match = True
                        if match == True:
                            steps_per_option = 1
                        elif 1 < steps_per_option:
                            steps_per_option = 1
                elif '+' in proteins or '-' in proteins:
                    match = False
                    highest_score = 0
                    proteins_present, proteins_required = process_compounds(proteins, 'compound', genome_annotation)
                    if proteins_present/proteins_required > score_for_step:
                        score_for_step = proteins_present/proteins_required
                        match = True
                    if match == True:
                        steps_per_option = 1
                    elif proteins_required < steps_per_option:
                        steps_per_option = 1
                else:
                    print("Unreccognized module {}. Check your database.".format(proteins))
                completed_steps += score_for_step
                if steps_per_option > 50:
                    steps_per_option = 1
                if steps_per_option > 1:
                    total_version_steps += steps_per_option - 1
            if completed_steps/total_version_steps > module_highest:
                module_highest = completed_steps/total_version_steps
        bifurcating_module_completenes.append((module, round(module_highest*100, 2)))
    return bifurcating_module_completenes

def structural_module_mapper(module_dictionary, genome_annotation_file):
    if not isinstance(genome_annotation_file, str):
        assert hasattr(genome_annotation_file, "__iter__"), "If `genome_annotation_file` is not a filepath then it must be an iterable of orthologs"
        genome_annotation = set(genome_annotation_file)

    else:
        assert isinstance(genome_annotation_file, str), "If `genome_annotation_file` is not an iterable of orthologs then it must be a string filepath"
        genome_annotation = annotation_file_parser(genome_annotation_file)

    structural_module_completenes = []
    for module, components in module_dictionary.items():
        score_for_components = 0
        module_proteins_present = 0
        module_proteins_required = 0
        for proteins in components:
            if isinstance(proteins, (list)):
                highest_score = 0
                proteins_to_add = 0
                steps_to_add = 100
                for option in proteins:
                    if '_' in option and ',' in option:
                        option = sorted(split_compound(option, 'and_comp'), key=len)
                        proteins_present_option = 0
                        proteins_required_option = 0
                        for element in option:
                            if ko_match(element) > 0:
                                if element in genome_annotation:
                                    proteins_present_option += 1
                                    proteins_required_option += 1
                                else:
                                    proteins_required_option += 1
                            elif ',' in element:
                                element = sorted(element.split(","), key=len)
                                score_sub_element = 0
                                proteins_present_sub_element = 0
                                proteins_required_sub_element = 0
                                for sub_element in element:
                                    if ko_match(sub_element) > 0:
                                        if sub_element in genome_annotation:
                                            score_sub_element = 1
                                            proteins_present_sub_element += 1
                                            proteins_required_sub_element += 1
                                        elif 1 < proteins_required_sub_element and score_sub_element == 0:
                                            proteins_required_sub_element = 1
                                    elif '+' in sub_element or '-' in sub_element:
                                        proteins_present, proteins_required = process_compounds(sub_element, 
                                        'compound', genome_annotation)
                                        if proteins_present/proteins_required > score_sub_element:
                                            score_sub_element = proteins_present/proteins_required
                                            proteins_present_sub_element = proteins_present
                                            proteins_required_sub_element = proteins_required
                                        elif proteins_required < proteins_required_sub_element and score_sub_element == 0:
                                            proteins_required_sub_element = proteins_required
                                proteins_present_option += proteins_present_sub_element
                                proteins_required_option += proteins_required_sub_element
                        if proteins_present_option/proteins_required_option > highest_score:
                            highest_score = proteins_present_option/proteins_required_option
                            proteins_to_add = proteins_present_option
                            steps_to_add = proteins_required_option
                        elif proteins_required_option < steps_to_add and highest_score == 0:
                            steps_to_add = proteins_required_option
                    elif '+' in option or '-' in option:
                        proteins_present, proteins_required = process_compounds(option, 'compound', genome_annotation)
                        if proteins_present/proteins_required > highest_score:
                            highest_score = proteins_present/proteins_required
                            steps_to_add = proteins_required
                            proteins_to_add = proteins_present
                        elif proteins_required < steps_to_add and highest_score == 0:
                            steps_to_add = proteins_required
                module_proteins_present += proteins_to_add
                module_proteins_required += steps_to_add
            elif ko_match(proteins) > 0:
                if proteins in genome_annotation:
                        module_proteins_present += 1
                        module_proteins_required += 1
                else:
                    module_proteins_required += 1
            elif ',' in proteins:
                proteins = sorted(proteins.split(","), key=len)
                score = 0
                proteins_present_option = 0
                proteins_required_option = 100
                for element in proteins:
                    if ko_match(element) > 0:
                        if element in genome_annotation:
                            proteins_required_option = 1
                            proteins_present_option = 1
                            break
                        elif 1 < proteins_required_option and score == 0:
                            proteins_required_option = 1
                    elif '+' in element or '-' in element:
                        proteins_present, proteins_required = process_compounds(element, 'compound', genome_annotation)
                        if proteins_present/proteins_required > score:
                            score = proteins_present/proteins_required
                            proteins_present_option = proteins_present
                            proteins_required_option = proteins_required
                        elif proteins_required < proteins_required_option and score == 0:
                            proteins_required_option = proteins_required
                module_proteins_present += proteins_present_option
                module_proteins_required += proteins_required_option
            elif '+' in proteins or '-' in proteins:
                proteins_present, proteins_required = process_compounds(proteins, 'compound', genome_annotation)
                module_proteins_present += proteins_present
                module_proteins_required += proteins_required  
            else:
                print("Unreccognized module {}. Check your database.".format(proteins))
        score_for_components = module_proteins_present/module_proteins_required
        structural_module_completenes.append((module, round(score_for_components*100, 2)))
    return structural_module_completenes

def global_mapper_from_dict(regular_modules, bifurcating_modules, structural_modules, grouping_to_orthologs):
    print("Mapping entries... ", end="")
    full_metabolic_completeness = {}

    for name, ortholog_set in grouping_to_orthologs.items():
#         genome_name = str(Path(file).name)
        regular_completeness = OrderedDict(regular_module_mapper(regular_modules, ortholog_set))
        bifurcating_completeness = OrderedDict(bifurcating_module_mapper(bifurcating_modules, ortholog_set))
        structural_completeness = OrderedDict(structural_module_mapper(structural_modules, ortholog_set))
        final_completeness = pd.concat([pd.Series(regular_completeness), pd.Series(bifurcating_completeness), pd.Series(structural_completeness)])
        full_metabolic_completeness[name] = final_completeness
    print("Done!\n")
    return pd.DataFrame(full_metabolic_completeness).T

And then I used sitepackages for the data directory:

# Import module data
import site
script_dir = site.getsitepackages()[0] #script_path.parent
data_folder = Path(script_dir) / "data"
# Import all modules from dictionaries
regular_modules = import_module_dict(data_folder / "01.KEGG_Regular_Module_Information.pickle")
bifurcation_modules = import_module_dict(data_folder / "02.KEGG_Bifurcating_Module_Information.pickle")
structural_modules = import_module_dict(data_folder / "03.KEGG_Structural_Module_Information.pickle")

Importing my pickled dictionary from another environment following the format: {"taxa_A":{"KEGG_ORTHOLOG_1", "KEGG_ORTHOLOG_2"..., "KEGG_ORTHOLOG_N"}

import pickle
with open("../../Caries/metaT/Data/annotations/caries.gt300.orfs.f_661740.kofamscan.significant.taxonomy.pkl", "rb") as f:
    organism_to_orthologs = pickle.load(f)
output = global_mapper_from_dict(regular_modules, bifurcation_modules, structural_modules, organism_to_orthologs)

image

Note the above code is just a quick hack but it appears to get the job done with the existing code and maintaining previous usage as well (i.e. it should work with files and ortholog sets)

cruizperez commented 3 years ago

This is awesome! I will try to implement this as well so it is easier to use with other pipelines directly. I am glad it worked out for you. I know these KEGG modules change a lot so I will probably need to update the modules I parsed.

jolespin commented 3 years ago

Thanks! I was trying to update the database with your 00.KEGG_Data_Scrapper.py but I didn't know how you made the 01.Bifurcating_List.txt and 02.Structural_List.txt files.

Did you pull those from a database or manually categorize them? Luckily your tool was available because I hadn't realized how complicated the modules are with the bifurcations and nested modules.

What category do "signature" modules fall under?

Would it be possible to add a small line file whenever a new database is created that says the date accessed? For example:

[Command] cat 00.Version.txt
[OUTPUT]
Version: XYZ
Accessed: 2021.03.22
URL: ftp/http://kegg..

This would be really useful for cross-referencing between different tools that use some version of KEGG. For example, if I was using my KEGG mapping from 2016 it probably wouldn't have worked because so many of the orthologs/modules were deprecated.

Also, your webdriver and bs4 code is really clean. I didn't know they could be used together like that! Learned quite a bit when reading through your code.

cruizperez commented 3 years ago

Hi Josh! Actually I think those bifurcating and structural were done manually because of the nested modules and bifurcations. I realized that most of the attempts I made to automate those broke others so I decided to do them manually for now. I think many of the signature ones fall in the regular modules because they are just list of KO identifiers so they were easy to parse with the regular ones. I will definitely add the version for the latest update. Thanks for the complements! It was my first time using those libraries so I am glad what I did was somewhat decent! Btw, if you want we can collaborate on the next release of the tool. I am thinking about including some tool changes to improve the speed and convenience of the tool.

jolespin commented 3 years ago

Yes, I'm definitely interested in collaborating. I've been doing some of this in random scripts here and there but it would be good to have it a bit more formalized. Are the limiting factors to automating the database update getting the modules categorized? If so, I'm wondering if we could use KEGG definition parser and create a networkx graph from each one to check if they are bifurcated. Are the structural modules parsed differently than the regular modules? If you could autodetect it from a definition, that could make automating the whole data download much easier.

matrs commented 3 years ago

I was about to ask a couple of question about ko_mapper.py. I've been using this package lately (we already cited it in an accepted paper ) and I happened to use the *Information.pickle files this package creates and used them to do a couple of things, but I didn't understand how they were created. An up vote for the options mentioned here.

jolespin commented 3 years ago

@matrs I believe this is the script used https://github.com/cruizperez/MicrobeAnnotator/blob/master/data/01.KEGG_DB/00.KEGG_Data_Scrapper.py (@cruizperez Please correct me if I'm wrong). It looks like the limiting factor to automating the database update is that the following are user-supplied: https://github.com/cruizperez/MicrobeAnnotator/blob/master/data/01.KEGG_DB/00.Module_Names.txt https://github.com/cruizperez/MicrobeAnnotator/blob/master/data/01.KEGG_DB/01.Bifurcating_List.txt https://github.com/cruizperez/MicrobeAnnotator/blob/master/data/01.KEGG_DB/02.Structural_List.txt

The first can easily be acquired from the most current KEGG with the following using Biopython:

from Bio.KEGG.REST import kegg_get

# Probably a better way to do this but I'm not go with regex
for line in kegg_get("md"): #md is the code for module
    line = line.strip()
    name = line.split(" ")[0]
    id = name[3:] # I think this is right.  You need to remove the "md:" prefix
    description = " ".join(line.split(" ")[1:])
    print(id, description)

The latter 2 are manually curated AFAIK.

matrs commented 3 years ago

hey @jolespin , thanks for your response. I do use Biopython and I already had some dictionaries with modules and KOs descriptions for example. I was curious about the manually curated ones (this thread informed me about those) and also wanted to use ko_mapper.py with other tools.

For example I did use 01.KEGG_Regular_Module_Information.pickle to get the list of KOs for each module, information that I could't obtain easily with Biopython (of course you can always parse).

cruizperez commented 3 years ago

Hi @matrs and @jolespin, Indeed the data scrapper was the script used to get most of the modules except for those that I manually supplied. Those were a little bit more cumbersome to parse so I decided to do them manually for the time being. This is on the list for improvements in future releases.

jolespin commented 3 years ago

Just wanted to congratulate you on your recent pub! https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03940-5

This will make it much easier to cite now.

cruizperez commented 3 years ago

Thanks @jolespin! I added a couple of changes to the new version, but I still need to improve on the retrieval of KO information.

jolespin commented 3 years ago

Do you have any ideas on how that would even work? I looked up a few methods but to no avail. The bifurcated modules are tricky.

cruizperez commented 3 years ago

I know! I have been trying several things but it is not straightforward. Whats more weird is that some of the non-bifurcating, e.g., structural modules, appear to have some parenthesis missing in their definition, and this prevents me from writing a generalizable method to parse them. Still looking into it though. :/

jolespin commented 3 years ago

Oh man this is bringing back some traumatizing memories of when I tried to make a generalizable parser a few months back before I found this tool haha.

Have you thought about reaching out the KEGG team about this? I would hope at least one of their team members have made a reliable parser for these definitions. Since MAPLE has been discontinued, your tool AFAIK is literally the only way to calculate module completion ratios. That would probably be really useful for them.