MoseleyBioinformaticsLab / GOcats

A tool for categorizing Gene Ontology into subgraphs of user-defined emergent concepts
Other
7 stars 2 forks source link

remapping has_part for enrichment #20

Closed rmflight closed 1 year ago

rmflight commented 2 years ago

Every time I want to do a GO remapping to properly follow the has_part relationship, I end up using some code for it I found from the enrichment manuscript, a copy of which is here

I've gone through the documentation for GOcats and the source, and I can't find anywhere where this functionality is just "built-in". Given that it's one of the more common use cases (at least on my end), should we have a function that just does this process?

Or is it actually mostly there already, and I just don't know which set of functions to properly put together from GOcats to do this in an easier way.

ehinderer commented 2 years ago

By following has_part the proper way, you mean re-representing the relationship using the conventional scope directionality, as argued in the original publication, right? So A-has_part->B becomes B-some_part_of->A.

If that's the case, this functionality is defaulted in GOcats build_graph_interpreter() . The standard workflow I've used for GOcats has been to call this function from the command line with the required arguments. Setting the --relationship_directionality parameter to anything other than "gocats" will cause the GOparser() initializer to treat all relationships to point in a naïve, default direction. Of course running the defaulted directions without omitting known problematic relationships (like has_part, positively_regulates, etc.) is not recommended, as these are known to cause incorrect semantic inferences.

hunter-moseley commented 2 years ago

Eugene,

I think the point Robert is trying to make is that the current gocats command line interface does not provide a method to dump out a dictionary of node to ancestors. This likely requires more discussion of what should be added to GOcats CLI.

The quick fix is to provide new subcommands save_gene_ancestor_map and save_namespace_map based on the code example that Robert provided. (But I do not know how busy you are and do not want to ask you to do this unless you have the time and the implementation looks relatively simple to you.)

Warm regards, Hunter

On Tue, May 10, 2022 at 1:08 PM Eugene Hinderer @.***> wrote:

By following has_part the proper way, you mean re-representing the relationship using the conventional scope directionality, as argued in the original publication, right? So A-has_part->B becomes B-some_part_of->A.

If that's the case, this functionality is defaulted in GOcats build_graph_interpreter() https://github.com/MoseleyBioinformaticsLab/GOcats/blob/84ceb8b18b52e5e90d97d4998fb369b066d6cbba/gocats/gocats.py#L60. The standard workflow I've used for GOcats has been to call this function from the command line with the required arguments. Setting the --relationship_directionality parameter to anything other than "gocats" will cause the ontology_parser() initializer https://github.com/MoseleyBioinformaticsLab/GOcats/blob/84ceb8b18b52e5e90d97d4998fb369b066d6cbba/gocats/ontologyparser.py#L44 to treat all relationships to point in a naïve, default direction. Of course running the defaulted directions without omitting known problematic relationships (like has_part, positively_regulates, etc.) is not recommended, as these are known to cause incorrect semantic inferences.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/20#issuecomment-1122655888, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B3MAHYEXML5MEQ3QITVJKJY3ANCNFSM5VFHFGDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 2 years ago

Thanks Hunter!

That is exactly the point.

The code builds the graph according to the set of relationships allowed, which are passed into the function

graph = gc.build_graph_interpreter(go_database, allowed_relationships=allowed_relationships)

Here is the full function definition, just for reference.

def build_ancestor_list(go_database, goa_gaf, allowed_relationships, ancestor_filename, namespace_filename):
    graph = gc.build_graph_interpreter(go_database, allowed_relationships=allowed_relationships)
    goa_gene_annotation_dict = defaultdict(set)
    # Building the annotation dictionary
    with open(goa_gaf, 'r') as gaf:
        reader = csv.reader(gaf, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
        for line in reader:
                goa_gene_annotation_dict[line[2]].add(line[4])  # the dictionary has DB object symbol  keys and a set of go terms as values
    ancestor_dict = defaultdict(set)
    missing_go_terms = set()
    # Adding ancestors to the annotation dictionary.
    for gene_symbol, go_term_set in goa_gene_annotation_dict.items():
        ancestor_dict[gene_symbol].update(go_term_set)
        for go_term in go_term_set:
            if go_term in graph.id_index.keys():
                ancestor_dict[gene_symbol].update([node.id for node in graph.id_index[go_term].ancestors])
            else:
                missing_go_terms.add(go_term)  # NOTE: These are missing because they are depreciated IDs that are now ALT IDs of another term. Need to incorporate alt ids in GOcats.
    # Need to convert dict sets into lists for json
    ancestor_dict = {gene_symbol: list(go_term_set) for gene_symbol, go_term_set in ancestor_dict.items()}
    # Writeout output json file.
    with open(ancestor_filename,"w") as output_file:
        json.dump(ancestor_dict, output_file)
    with open(namespace_filename, "w") as output_file:
        namespace_translation = {}
        for node in graph.node_list:
            namespace_translation[node.id] = node.namespace
        json.dump(namespace_translation, output_file)
    return None

build_ancestor_list("../go_data/go.obo", "../go_data/goa_human.gaf", ["is_a", "part_of", "has_part"],
  "../go_data/ancestors_list.json", "../go_data/namespace_translation.json")

I would be more than willing to attempt a pull request to add this functionality to GOcats. I just wanted to be sure it didn't already exist somewhere and that I had missed it.

ehinderer commented 2 years ago

Oh okay, Sorry, I was rushed yesterday and didn't fully read through your code example. Writing a dict comprehension return as a property of OboGraph (maybe called "ancestor_mapping" or something) that outputs {node.id: id_index[node.id].ancestors) for node in self.node_list} might do what you're asking here, right? Then you'd just need to call graph.ancestor_mapping to get the return. Could also do the same with descendants as well. I think it would look like this:

OboGraph(object)
...
def __init__()...
        ...
        self._ancestor_mapping = None
        ...
    @property
    def ancestor_mapping(self):
        """...
        """
        if self._modified or not self._ancestor_mapping:
            self._update_graph()
            self._ancestor_mapping = {node.id: self.id_index[node.id].ancestors) for node in self.node_list}
        return self._ancestor_mapping

Please check my logic, it's been a minute since I've done any OOP! BTW ancestors is an @property on the node object, to it also automatically updates if it had been modified for any reason.

ehinderer commented 2 years ago

@rmflight does this addition seem like it would be helpful? I'll assign myself as a reminder to add this in if so.

rmflight commented 2 years ago

I don't know @ehinderer , I'm not sure 100% how that maps into the logic in the function I previously provided, honestly.

ehinderer commented 2 years ago

Is the ultimate goal to have a dictionary of {go_term: [all ancestors of that go_term]}?

rmflight commented 2 years ago

Yes, that is the goal.

And then be able to export that as a JSON for use in categoryCompare, etc.

hunter-moseley commented 2 years ago

Robert,

Your example code generates {gene_id:current_and_ancestor_node_list}.

Isn't that what you need for category compare?

Warm regards, Hunter

On Fri, May 13, 2022 at 10:28 AM Robert M Flight @.***> wrote:

Yes, that is the goal.

And then be able to export that as a JSON for use in categoryCompare, etc.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/20#issuecomment-1126121767, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B5EP75OBNSUOUPOFCTVJZRKTANCNFSM5VFHFGDA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

ehinderer commented 2 years ago

Okay, I think that would do 90% of it it actually, except it would need to convert the set of ancestors into a list first for JSON. The return of ancestor_mapping is a dictionary of ALL nodes in the graph object. But perhaps the function could also directly write a JSON file for more convenience.

rmflight commented 2 years ago

@hunter-moseley Yes, that is what I need for categoryCompare.

I was really just trying to double check that this code didn't already exist in GOcats, and that I hadn't missed it.

Given it's utility as essentially a prime use case when we dont want to generate just new GO term groups, I think the basics of this code should be a part of GOcats, instead of having to write a separate function myself and running it.

Ideally, any python code written by a user should be 3 - 4 lines for this use case.

# read in OBO
# remap based on allowed relationships
# generate ancestors for each term after remapping
# spit out JSON

That's what I'm after, instead of the several steps involved in the current function.

ehinderer commented 2 years ago

With what I suggested, this would still require that you apply each term+ancestor to the gene symbol in your dict. Alternatively, there could be an additional function to take as input a dictionary of {gene: [annotations]} and do the annotation mapping all at once. If that's what you were looking for, it is not in GOcats currently, but could be added.

hunter-moseley commented 2 years ago

Eugene,

I think both functionalities would be useful: node_ancestor_map gene_ancestor_map

These could be implemented as CLI subcommands that take the appropriate parameters.

I would suggest taking the code that Robert provided and incorporate it into GOcats. Should save effort to implement.

Thanks, Hunter

On Fri, May 13, 2022 at 11:00 AM Eugene Hinderer @.***> wrote:

With what I suggested, this would still require that you apply each term+ancestor to the gene symbol in your dict. Alternatively, there could be an additional function to take as input a dictionary of {gene: [annotations]} and do the annotation mapping all at once. If that's what you were looking for, it is not in GOcats currently, but could be added.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/20#issuecomment-1126151490, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B4EJRP7VSB5LUYGAV3VJZU7DANCNFSM5VFHFGDA . You are receiving this because you were mentioned.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

ehinderer commented 2 years ago

Creating a new branch for changes. Hopefully can complete this week!

hunter-moseley commented 1 year ago

Eugene,

That would be an ancestor_map, but what Robert needs is a gene_ancestor_map gene_id:list(nodes and their ancestors).

Warm regards, Hunter

On Wed, May 11, 2022 at 2:45 PM Eugene Hinderer @.***> wrote:

Oh okay, Sorry, I was rushed yesterday and didn't fully read through your code example. Writing a dict comprehension return as a property of OboGraph (maybe called "ancestor_mapping" or something) that outputs {node.id: id_index[node.id].ancestors) for node in self.node_list} might do what you're asking here, right? Then you'd just need to call graph.ancestor_mapping to get the return. Could also do the same with descendants as well. I think it would look like this:

OboGraph(object) ... def init()... ... self._ancestor_mapping = None ... @property def ancestor_mapping(self): """... """ if self._modified or not self._ancestor_mapping: self._update_graph() self._ancestor_mapping = {node.id: id_index[node.id].ancestors) for node in self.node_list} return self._ancestor_mapping

Please check my logic, it's been a minute since I've done any OOP! BTW ancestors is an @Property https://github.com/Property on the node object, to it also automatically updates if it had been modified for any reason.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/20#issuecomment-1124169839, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B3ICPWLNSF2FM44UMLVJP527ANCNFSM5VFHFGDA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

hunter-moseley commented 1 year ago

But let's be clear.

The following two functionalities would be useful:

ancestor_map {go_term: ancestor_list }

gene_ancestor_map {gene_id: ancestor_list_and_current_node_list}

On Fri, May 13, 2022 at 10:44 AM Robert M Flight @.***> wrote:

@hunter-moseley https://github.com/hunter-moseley Yes, that is what I need for categoryCompare.

I was really just trying to double check that this code didn't already exist in GOcats, and that I hadn't missed it.

Given it's utility as essentially a prime use case when we dont want to generate just new GO term groups, I think the basics of this code should be a part of GOcats, instead of having to write a separate function myself and running it.

Ideally, any python code written by a user should be 3 - 4 lines for this use case.

read in OBO

remap based on allowed relationships

generate ancestors for each term after remapping

spit out JSON

That's what I'm after, instead of the several steps involved in the current function.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/20#issuecomment-1126136302, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B5MXEBLGEUECJRQ6CLVJZTEDANCNFSM5VFHFGDA . You are receiving this because you were mentioned.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 1 year ago

This will be addressed by #23 when it is merged.

rmflight commented 1 year ago

closed by #23