Open hunter-moseley opened 3 years ago
To document the planned changes:
Planning on identifying regulatory inferences in GO by incorporating inferred regulatory ancestors of Regulates/negatively_regulates/positively_regulates edges into the list of annotations associated to a genes/gene products in a gene annotation file (GAF). This "regulatory GAF" (rGAF) should allow for enrichment of regulatory mechanisms when used as an input for hypergeometric enrichment analyses.
4 types of rGAFs will exist, one for each type of regulation edge in GO, and one for all three:
The inference logic is as follows: If (A) -[regulates/[positively_regulates/negatively_regulates]-> (B); and (A) -[is_a]-> (A') -[is_a]-> [A'']; and (B) -[is_a/part_of/part_of_some]-> (B') -[is_a/part_of/part_of_some]-> [B'']
Then all instances of genes with annotations B, B', or B'' will--in the rGAF--instead be annotated to A, A', and A'' (the full ancestor set of hypernym relations).
The process should be accomplished with three nested loops:
In the rGAF, original gene annotations (and their ancestors) are not associated with the original gene, they are exclusive to regulatory annotations. However, we may enable a special case of rGAF which also includes the original annotations in future iterations.
* part_of_some is a logical approximation of the inverse of has_part, where the interpretation is that some instances of the ancestors of one concept are part of the other concept (non-universal; i.e. some but not all instances of B part_of B' if the original relation was B' has_part B). This logical approximation is appropriate in the context of gene annotation enrichment, see Hinderer et. al. 2019.
In creating the rGAF, the original gene annotations must not be included, since they do not represent the regulation relationship.
There could be an option to include original gene annotations that match an A set, but this should be an option and not the default. Also, if this option is allowed, then the ancestors of any original gene annotation matching an A set would need to be included as well. The resulting rGAF would thus include the direct annotations of the regulator (A) and the regulation annotations based on matching B.
I've updated the description of the planned changes, do they look correct now? I'll hopefully have some time to work on it this week as long as I'm understanding the intention properly.
Just to be clear, you need to check if a gene's specific annotation is a member of the B_plus_ancestors set. This was not explicitly stated.
Okay, check the italicized changes and hopefully I've captured it accurately now!
That clearly states what should be done. By the way, it would be good to have options that limit the A_plus_ancestors and B_plus_ancestors sets to just A and B respectfully. Something like --limit-regulator (for A) and --limit-regulatee (for B).
I'm wondering if I should just add a new argument to gocats.categorize_dataset() for outputting the rGAF? The issue is that we aren't necessarily interested in categorizing the annotations in this use case.
Alternatively, I could write a new top-level function. This would mean that you could run it from the command line. That function would:
Also, I created a new branch for tracking these changes. I think it's best to work within GitHub for these changes, since we're already in release versions.
Would suggest a new top-level function. If done the right way, the rGAF could be later categorized.
Okay, working on it now!
@hunter-moseley When you get a chance could you double check my logic in the new commit on rGAF. Here's the permalink to the new create_regulatory_gaf() method.
I am running out of memory when doing this. I believe including all ancestors of each annotation is too permissive. It's leading to a lot of regulatory annotations being added. From the few I looked at, they looked reasonable. But I'd like to make sure I'm not doing anything silly before suggesting we move this to the computing cluster.
mapped_rgaf_array needs to be built from a set of node.id, otherwise you are likely to have a lot of duplicates. This may be why you are running out of memory. Also, you need to make sure you are analyzing one gene's worth of annotations at a time. Otherwise, mapped_rgaf_array is going to have a lot of duplicates.
Changes are reflected in rGAF branch. Was this tested and is it safe to merge?
I think it is safe to merge.
We plan on using the rGAFs generated so far in the next few months, hopefully to generate a publishable result.
On Mon, May 23, 2022 at 11:37 AM Eugene Hinderer @.***> wrote:
Changes are reflected in rGAF branch. Was this tested and is it safe to merge?
— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/GOcats/issues/14#issuecomment-1134830161, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B7GEC3CM5RKWJCCRGDVLOQ4NANCNFSM4WKBW6QA . You are receiving this because you were mentioned.Message ID: @.***>
Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093
Would be nice if GOcats could generate a regulation GAF.