MoseleyBioinformaticsLab / GOcats

A tool for categorizing Gene Ontology into subgraphs of user-defined emergent concepts
Other
7 stars 2 forks source link

Regulation GAF #14

Open hunter-moseley opened 3 years ago

hunter-moseley commented 3 years ago

Would be nice if GOcats could generate a regulation GAF.

ehinderer commented 3 years ago

To document the planned changes:

Planning on identifying regulatory inferences in GO by incorporating inferred regulatory ancestors of Regulates/negatively_regulates/positively_regulates edges into the list of annotations associated to a genes/gene products in a gene annotation file (GAF). This "regulatory GAF" (rGAF) should allow for enrichment of regulatory mechanisms when used as an input for hypergeometric enrichment analyses.

4 types of rGAFs will exist, one for each type of regulation edge in GO, and one for all three:

The inference logic is as follows: If (A) -[regulates/[positively_regulates/negatively_regulates]-> (B); and (A) -[is_a]-> (A') -[is_a]-> [A'']; and (B) -[is_a/part_of/part_of_some]-> (B') -[is_a/part_of/part_of_some]-> [B'']

Then all instances of genes with annotations B, B', or B'' will--in the rGAF--instead be annotated to A, A', and A'' (the full ancestor set of hypernym relations).

The process should be accomplished with three nested loops:

  1. Iterate through gene annotations provided in the original GAF (with direct annotations).
  2. Iterate through all edges in GO, searching for regulates/positively_regulates/negatively_regulates/(any) edges
  3. When a regulation edge is found, if the object of the edge or any of its ancestors (B, B', or B'' in the example above) is in the set of annotations for the gene/gene product in the original GAF, create a new annotation set which includes A and its ancestors as described in the inference logic above. Replace the original annotation set for genes found with regulation edges with this new set in the rGAF.

In the rGAF, original gene annotations (and their ancestors) are not associated with the original gene, they are exclusive to regulatory annotations. However, we may enable a special case of rGAF which also includes the original annotations in future iterations.

* part_of_some is a logical approximation of the inverse of has_part, where the interpretation is that some instances of the ancestors of one concept are part of the other concept (non-universal; i.e. some but not all instances of B part_of B' if the original relation was B' has_part B). This logical approximation is appropriate in the context of gene annotation enrichment, see Hinderer et. al. 2019.

hunter-moseley commented 3 years ago

In creating the rGAF, the original gene annotations must not be included, since they do not represent the regulation relationship.

There could be an option to include original gene annotations that match an A set, but this should be an option and not the default. Also, if this option is allowed, then the ancestors of any original gene annotation matching an A set would need to be included as well. The resulting rGAF would thus include the direct annotations of the regulator (A) and the regulation annotations based on matching B.

ehinderer commented 3 years ago

I've updated the description of the planned changes, do they look correct now? I'll hopefully have some time to work on it this week as long as I'm understanding the intention properly.

hunter-moseley commented 3 years ago

Just to be clear, you need to check if a gene's specific annotation is a member of the B_plus_ancestors set. This was not explicitly stated.

ehinderer commented 3 years ago

Okay, check the italicized changes and hopefully I've captured it accurately now!

hunter-moseley commented 3 years ago

That clearly states what should be done. By the way, it would be good to have options that limit the A_plus_ancestors and B_plus_ancestors sets to just A and B respectfully. Something like --limit-regulator (for A) and --limit-regulatee (for B).

ehinderer commented 3 years ago

I'm wondering if I should just add a new argument to gocats.categorize_dataset() for outputting the rGAF? The issue is that we aren't necessarily interested in categorizing the annotations in this use case.

Alternatively, I could write a new top-level function. This would mean that you could run it from the command line. That function would:

Also, I created a new branch for tracking these changes. I think it's best to work within GitHub for these changes, since we're already in release versions.

hunter-moseley commented 3 years ago

Would suggest a new top-level function. If done the right way, the rGAF could be later categorized.

ehinderer commented 3 years ago

Okay, working on it now!

ehinderer commented 3 years ago

@hunter-moseley When you get a chance could you double check my logic in the new commit on rGAF. Here's the permalink to the new create_regulatory_gaf() method.

I am running out of memory when doing this. I believe including all ancestors of each annotation is too permissive. It's leading to a lot of regulatory annotations being added. From the few I looked at, they looked reasonable. But I'd like to make sure I'm not doing anything silly before suggesting we move this to the computing cluster.

hunter-moseley commented 3 years ago

mapped_rgaf_array needs to be built from a set of node.id, otherwise you are likely to have a lot of duplicates. This may be why you are running out of memory. Also, you need to make sure you are analyzing one gene's worth of annotations at a time. Otherwise, mapped_rgaf_array is going to have a lot of duplicates.

ehinderer commented 2 years ago

Changes are reflected in rGAF branch. Was this tested and is it safe to merge?

hunter-moseley commented 1 year ago

I think it is safe to merge.

We plan on using the rGAFs generated so far in the next few months, hopefully to generate a publishable result.

On Mon, May 23, 2022 at 11:37 AM Eugene Hinderer @.***> wrote:

Changes are reflected in rGAF branch. Was this tested and is it safe to merge?

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/14#issuecomment-1134830161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B7GEC3CM5RKWJCCRGDVLOQ4NANCNFSM4WKBW6QA . You are receiving this because you were mentioned.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Associate Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093