geneontology / gocamgen

Base repo for constructing GO-CAM model RDF
0 stars 0 forks source link

Update rules spreadsheet #68

Open dustine32 opened 4 years ago

dustine32 commented 4 years ago

Some more rule changes from the 2019-11-07 whole genome imports call:

dustine32 commented 4 years ago

I generated new bad_extensions spreadsheets with these changes.

@ukemi Here's the MGI sheet based on 2019-11-05 upstream GPAD.

@vanaukenk Here's the WB sheet still based on the 2019-10-07 GO release GPAD.

ukemi commented 4 years ago

Rats! Still over 100. Some of these should be cleaned up with the next MGI GPAD release and some still look like they should be passing to me.

dustine32 commented 4 years ago

@ukemi Here's this week's MGI report from the 2019-11-13 upstream GPAD with the validation rule changes from above: https://docs.google.com/spreadsheets/d/1Iw_9gFZTRvKyJ3RU8Q6X51wSE6FQGZ8B2Sb_mvs9DC4/edit#gid=0

dustine32 commented 4 years ago

@ukemi Here's this week's MGI report from the 2019-11-19 upstream GPAD with the validation rule changes from above: https://docs.google.com/spreadsheets/d/1reWzNHOb4E3rs2QWrVqDxOqyEb7OYB-H-TCCQanKnnc/edit#gid=0

ukemi commented 4 years ago

Happy dance! Less than 100. Let's look at the remainders tomorrow. @vanaukenk, woo hooo!

dustine32 commented 4 years ago

@ukemi @vanaukenk Just realized something with the adjacent_to rule for extracellular region (GO:0005576): https://github.com/geneontology/gocamgen/blob/c1d5724e52cc0efdcbea742bc4317ed6822581fd/resources/formatted_ext_patterns.tsv#L48 Regarding the MGI annotations to extracellular space GO:0005615 and extracellular matrix GO:0031012, both of these terms are descendants of extracellular region via the part of relation. As it turns out I'm only checking is a descendants: image image I don't know if we specifically discussed this but should I include the part of relation when checking primary term descendants? I could see this causing issues with the MF-part_of->BP bridge though would need to do some testing to confirm. Does ShEx follow part_of paths?

ukemi commented 4 years ago

My gut feeling is that even if it works for the cases we have enumerated it is not universally true and will open a can of worms. For example let's hypothetically say that there is a cellular component that is a membrane-bound cytosolic vesicle and consists of a membrane that completely surround a lumen. Both the membrane and the lumen would be parts of the vesicle, and it would be true to say that the vesicle and it's membrane are adjacent to the cytosol, but it would be false to say that the lumen is adjacent to the cytosol. I am very uncomfortable making rules that might not always be true. I'd rather be safe and assert only what we know. @vanaukenk ?

ukemi commented 4 years ago

It's early, but thinking about this more. It seems like these types of issues would best be considered by thinking about rigid property chains. In this case part_of-o-adjacent_to -> adjacent_to is not valid so we wouldn't propagate.

vanaukenk commented 4 years ago

Good catch @dustine32 I agree with @ukemi : for now, we need to be conservative and just use the is_a hierarchy. The ShEx is only following part_of in BP, i.e. only a BP can be part of another BP. I'll take a look at the existing MGI and WB annotations to see what terms we need to add for 'adjacent to' for now, but we will need to flesh this out more in the future.

ukemi commented 4 years ago

I just fixed all the annotations that I think were problematic at the annotation-level. Will the next round yield a blank spreadsheet?

vanaukenk commented 4 years ago

@dustine32 @ukemi

For 'adjacent to', the only CC terms for which WB and MGI have direct annotations or annotations to an is_a child (according to AmiGO) are, so let's go with this for now.

extracellular region (GO:0005576) extracellular space (GO:0005615) extracellular matrix (GO:0031012)

I am switching the last three WB annotations that have 'part of' extensions with these terms (or is_a children) to 'adjacent to'.

vanaukenk commented 4 years ago

Actually, I just realized that one of these 'part of' extensions is coming from a GO-CAM model and the enables_o_occurs_in -> part_of property chain.

I will leave that annotation alone for now but we will want to make sure we have a rule in place to get the desired 'adjacent to' extensions back out of our models for the appropriate CC terms.

vanaukenk commented 4 years ago

@ukemi Working on the ShEx shapes I have a question about the second 'results in specification of' entry in the tsv. The term associated with 'results in specification of' EMAPA,UBERON,WBbt is 'regulation of cell maturation' which I think might be a mistake. For this pair, I propose using 'pattern specification process' (GO:0007389). What do you think?

ukemi commented 4 years ago

Looks like in the ontology we have used it for 'specification of x organ identity' as well. I think we should revisit this. It was originally intended for cell fate.

ukemi commented 4 years ago

Cell maturation is definitely incorrect.

vanaukenk commented 4 years ago

Here's the definition of 'results in specification of':

"The relationship linking a cell and its participation in a process that results in the fate of the cell being specified. Once specification has taken place, a cell will be committed to differentiate down a specific pathway if left in its normal environment."

So, yes, we would either want to update the relation def or come up with a new relation for the 'specification of x organ identity' terms.

ukemi commented 4 years ago

I think we should keep the definition consistent with its original intent.

vanaukenk commented 4 years ago

Sounds good.
Looking through the MGI and WB annotations again, though, I'm not convinced we need that second line for 'results in specification of'.
And maybe we do want a check that this relation was only used with cell? @ukemi - if you agree, I'll delete that line from the tsv

ukemi commented 4 years ago

Maybe my spreadsheet won't be blank.

dustine32 commented 4 years ago

@vanaukenk @ukemi OK, I agree to just explicitly list the part_of-related terms in the adjacent_to rule rather than open up the code to globally traverse the part_of paths.

extracellular region (GO:0005576) extracellular space (GO:0005615) extracellular matrix (GO:0031012)

I can add the missing GO:0005615 and GO:0031012 to the TSV under branch issue-68-valid-exts: https://github.com/geneontology/gocamgen/blob/c1d5724e52cc0efdcbea742bc4317ed6822581fd/resources/formatted_ext_patterns.tsv#L48

dustine32 commented 4 years ago

@vanaukenk @ukemi I made the above change for adjacent_to and merged this issue's branch into master. Since I have a new batch of rule changes to make from yesterday's call, I'm going to close this issue and make the changes under https://github.com/geneontology/gocamgen/issues/73.

But feel free to re-open this and/or continue the conversation here!