geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Update main pipeline output to produce usable GPAD/GPI 2.0 #2043

Closed kltm closed 7 months ago

kltm commented 1 year ago

This was created from a conversation @ukemi and @sierra-moxon , making explicit an implicit task.

geneontology/gopreprocess#9

ukemi commented 1 year ago

Location of @sierra-moxon's files

sierra-moxon commented 1 year ago

first pass of merged GPAD (all noctua MGI annotations from current.geneontology.org + all preprocessed/upstream annotations produced in this mini-pipeline in a GPAD 2.0 file): https://drive.google.com/drive/folders/1aZxvumsODSvXGbk_gMdFtuGhhAq4MKdL

I already see two issues: taxon isn't coming through the conversion for some of the rows, and some of the rows were labeled as provided_by MGI -- I think both of these issues come from the GAF->GPAD step and not as a result of the underlying GAF generation, but I am confirming.

leemdi commented 1 year ago

@sierra-moxon

I am seeing !gpa-version: 1.2 at the end of the merged_gpad_11_08_2023.txt. should this file just contain 2.0?

thanks. Lori

ukemi commented 1 year ago

I am looking at the errors that Lori's load threw:

  1. We were missing an RO identifier that I have added to MGI
  2. There are several GO-REFS that we don't have in MGI. I will look at those closely and most likely add them.
  3. We don't have any of the Reactome references. I need to figure out what to do with those. I will need to track down how we handle those annotations.
  4. Lori says the load is filtering out a lot of duplicates. This doesn't surprise me.

Today I will also just do a sanity check on @sierra-moxon's file.

ukemi commented 1 year ago

Notes from yesterday's group call.

ukemi commented 1 year ago

Hi @sierra-moxon. I have a couple of questions to reassure myself that I didn't just think things without actually putting them into the requirements. 1) We are not appending annotations to the mega-file when the UniProt identifier didn't map to an MGI gene identifier, but we would like a report of those that didn't. 2) We are processing the IEA annotations for mouse that would be in the non-isoform file consumed in pipeline #329. We filter out the IEAs in the ISO loads.

ukemi commented 1 year ago

After a bit of investigation:

  1. It looks like the Reactome annotations are being filtered at MGI. THese lines are from @leemdi 's unresolvedB.error UniProtKB P01723 P01723 located_in GO:0005886 Reactome:R-MMU-983702 TAS C Ig lambda-1 chain V region protein taxon:10090 20120109 Reactome
    UniProtKB P01843 P01843 located_in GO:0005886 Reactome:R-MMU-983702 TAS C Ig lambda-1 chain C region protein taxon:10090 20120109 Reactome
    Although I only see 54 annotations in this error file, even though there are more than a thousand hits in the incoming gaf. @leemdi can you figure out where the others are being filtered?

  2. It looks like the IEAs are included as the missing refs are for methods that we didn't previously run at MGI. See my analysis here: https://docs.google.com/spreadsheets/d/1LwwN3RgyGsDQfdggczJ34Qu78XV-WtkB1JtdBOHPZw4/edit#gid=0

ukemi commented 1 year ago

@leemdi and @sierra-moxon It looks like the GPAD1.2 annotations are the ones from the Noctua output.

deustp01 commented 1 year ago
  1. Ig lambda-1 chain V region

@ukemi Could the nature of this immunoglobulin UniProt be causing its own specific problem? UniProt has separate instances for the constant region and the variable region of what occurs in the body, mouse or human, as a single polypeptide encoded by a gene that is not present in the germline but created somatically. The annotation is thus a hack in two ways (both unavoidable, as far as I can tell). First, it represents the full length immunoglobulin chain as a complex of a UniProt C protein and a separate UniProt V protein. Second, that V protein is an arbitrarily chosen single instance because there's no way to represent the diversity of possible V regions in this annotation.

ukemi commented 1 year ago

Yep! I think you are spot on and this may be the case with the 54 annotations in that report! When I look at the report now that you point this out, all the genes are things like this (immunoglobulin regions, histones). So the one above might be a red herring wrt why all the annotations are failing our load. I also think that we are filtering on our end because we don't have Reactome reactions/pathways as references in MGI. It was in that report that I first detected this. eg:

Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1008243 Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1013867 Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1013873 Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-111519 Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1168790 Invalid Reference/either no pubmed id or no jnum (5): Reactome:R-MMU-1168910

leemdi commented 1 year ago

when we process the GOA/Mouse, we save all of the Reactome rows (for which we do not have the reference in MGI) to a goamouse.gaf file. Then we append the goamouse.gaf to the end of our mgi/gaf.

ukemi commented 1 year ago

Thanks @leemdi ! Yes, I saw that file. @deustp01 we should revisit this at a GO@MGI lab meeting.

leemdi commented 1 year ago

@sierra-moxon you mean: the !gpa-version: 1.2 at the end of the merged_gpad_11_08_2023?

leemdi commented 1 year ago

@sierra-moxon

Looks like the ones in the 1.2 format, which I am skipping, are the NOCUTA ones. I think you have mentioned this earlier.
I am skipping them because I changed my code to process version 2.0, not 1.2.

example: Shh GO:0000122

ukemi commented 1 year ago

@sierra-moxon and @leemdi I am looking at the list of errors in which Uberon IDs were not converted to EMAPA IDs. For many, I don't see a mapping. But for some I do see an EMAPA xref, but can't find that ID. Here is the list of errors: https://docs.google.com/spreadsheets/d/1knEybI3QBkiaKHfBhIasKjOJpAqueDN5055_YKdcUFU/edit#gid=0

Looks like the mapping needs updating on the EMAPA/UBERON end. I have emailed Terry about this and sent her the list.

Terry says she will look at the list and open tickets for new mappings at UBERON. Since she has a much better background in all things anatomy than I do, this is a good plan.

leemdi commented 1 year ago

@sierra-moxon

the lines with multiple entries in field 7 contain '"'

this line is OK: MGI:MGI:1919439 RO:0002327 GO:0005515 PMID:15102471 ECO:0000353 UniProtKB:O35305 2023-09-09 GO_Central

this line has '"' in line 7: MGI:MGI:1333854 RO:0002327 GO:0005515 PMID:23478294 ECO:0000353 "UniProtKB:O35305,UniProtKB:P24604,UniProtKB:P35991,UniProtKB:Q78T81,UniProtKB:Q8CIH5"

is the '"' surrounding the multiple UniProtKB terms expected? or is this something you want to fix?

ukemi commented 1 year ago

Hi @leemdi, I believe comma and pipe separated values in the 'with' field are allowed. https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

leemdi commented 1 year ago

@ukemi @sierra-moxon ok, that seems odd to me. if I'm processing field 7, then I would expect the delimiter, but not a leading/ending ". but I can get rid of these on our end.

ukemi commented 1 year ago

Sorry! I missed that point. I don't think there should be a ".

sierra-moxon commented 1 year ago

Hi @sierra-moxon. I have a couple of questions to reassure myself that I didn't just think things without actually putting them into the requirements.

  1. We are not appending annotations to the mega-file when the UniProt identifier didn't map to an MGI gene identifier, but we would like a report of those that didn't.

yes, I can do that, but I don't have it done yet.

  1. We are processing the IEA annotations for mouse that would be in the non-isoform file consumed in pipeline #329. We filter out the IEAs in the ISO loads.

Yes, I include IEA annotations from GOA, but not from the orthology transformation loads (ISO loads)

sierra-moxon commented 1 year ago

@sierra-moxon

the lines with multiple entries in field 7 contain '"'

this line is OK: MGI:MGI:1919439 RO:0002327 GO:0005515 PMID:15102471 ECO:0000353 UniProtKB:O35305 2023-09-09 GO_Central

this line has '"' in line 7: MGI:MGI:1333854 RO:0002327 GO:0005515 PMID:23478294 ECO:0000353 "UniProtKB:O35305,UniProtKB:P24604,UniProtKB:P35991,UniProtKB:Q78T81,UniProtKB:Q8CIH5"

is the '"' surrounding the multiple UniProtKB terms expected? or is this something you want to fix?

definitely I want to fix this! :) thank you for finding it.

sierra-moxon commented 1 year ago

@sierra-moxon

Looks like the ones in the 1.2 format, which I am skipping, are the NOCUTA ones. I think you have mentioned this earlier. I am skipping them because I changed my code to process version 2.0, not 1.2.

example: Shh GO:0000122

gotcha - I am updating.

leemdi commented 1 year ago

@sierra-moxon I have found some obsolete GO terms in the gpad. Is there any way to use a new MGI/Lori file?

133 annotations

examples:

GO:0000083 MGI:101934 J:164563 ISO UniProtKB:Q14186 GO_MGI 2023-03-22 MGI go_qualifier_id&=&RO:0002331&==&go_qualifier_term&=&involved_in&==&evidence&=&ECO:0000266

GO:0090305 MGI:102779 J:164563 ISO UniProtKB:P39748 GO_MGI 2016-09-14 MGI go_qualifier_id&=&RO:0002331&==&go_qualifier_term&=&involved_in&==&evidence&=&ECO:0000266

sierra-moxon commented 1 year ago

absolutely! I will rerun with new files.

leemdi commented 1 year ago

@sierra-moxon I am also finding quotes in gpad/field 11 (property).

leemdi commented 1 year ago

@sierra-moxon @ukemi

in field 11/properties

I am used to seeing things like this: occurs_in(CL:0000622),occurs_in(EMAPA:18537),has_input(PR:Q01341)

now I am seeing this: 2018-07-23 GO_Central RO:0002233(RNAcentral:URS000075DA6B_10090),RO:0002233(RNAcentral:URS000075A5B2_10090),RO:0002233(RNAcentral:URS000075E1B6_10090),BFO:0000066(UBERON:0002107)

David, so, I assume that I need to map the RO or BFO to terms? But not sure what to do with the rest of the info in this new Property.

David: I have found a couple of duplicates in MGI/GO Property vocabulary.

occurs_at | BFO:0000066 occurs_in | BFO:0000066

happens_during | RO:0002092 during | RO:0002092

has_input | RO:0002233 results_in_division_of | RO:0002233 has_regulation_target | RO:0002233 has_direct_input | RO:0002233

has_target_end_location | RO:0002339 has_end_location | RO:0002339

exists_during | RO:0002491 existence_starts_and_ends_during | RO:0002491

leemdi commented 1 year ago

@sierra-moxon in the mgi.gpad (lori file), is see that in field 11, so of the delimiters are "," and some are "|". is there a way to use the same delimiter? either "," or '|", but not a mix? This would make it easier for me on our end.

MGI:MGI:108212 RO:0002331 GO:0042327 PMID:21052097 ECO:0000315 2014-07-25 GO_Central RO:0002092(GO:0060546)|RO:0002233(UniProtKB:Q63844)

MGI:MGI:108212 RO:0002331 GO:0070301 PMID:27258785 ECO:0000315 2018-11-27 GO_Central "BFO:0000066(CL:0000746),BFO:0000050(GO:0070301)"

kltm commented 1 year ago

in the mgi.gpad (lori file), is see that in field 11, so of the delimiters are "," and some are "|". is there a way to use the same delimiter? either "," or '|", but not a mix?

Only | should be legal? https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-2-0.md

balhoff commented 1 year ago

| separates alternatives, , joins conjunctive expressions within each alternative (deeper in the grammar). So it's fine to have just commas if they all combine to describe a single context.

leemdi commented 1 year ago

@sierra-moxon I can deal with "|" or ",". For MGI purposes, I'll just convert them to one or the other for consistency in MGI logic.

sierra-moxon commented 1 year ago

I think what @balhoff said is important though - each delimiter means something different, so if you convert them to one or the other, you may lose information.

kltm commented 1 year ago

@sierra-moxon @leemdi Assuming that they are correctly being used according to spec, | separates statements and , can be used within them--they are both used and required according to the spec and they may be emitted.

leemdi commented 1 year ago

@sierra-moxon MGI processing is different. We take field 3, field 11 and field 12 and merge all of that info into one MGI-property (for our schema). So, GO may differentiate these fields. But in MGI-world, we do not.

leemdi commented 1 year ago

@sierra-moxon @kltm as long as our MGI/GO annotation looks correct and we can pass things on to our public interface & display, we are happy.

kltm commented 1 year ago

@leemdi We're bound to follow the spec in this case. We can look at iterating on the spec if needed, but MGI can also do post-processing to help a correct file conform to your schema needs.

leemdi commented 1 year ago

@kltm I don't think GO needs to change the spec. We can handle what we get with the mgi.gpad. Again, our schema is merging 3 fields -> 1 field. And this works for us. No issue.

ukemi commented 1 year ago

@leemdi

The duplicate properties are in MGI becasue when we were doing the MGI load into Noctua, we cleaned up relations by mapping some textual relations that we were not going to use any more to a single RO term. Since the new inputs will be in GPAD 2.0, they will contain the relation identifiers. We should use the official RO relation labels for those identifiers. Here are the ones for which I will remove the mapping in MGI once we switch over to the new file.

occurs_at | BFO:0000066 during | RO:0002092 results_in_division_of | RO:0002233 has_regulation_target | RO:0002233 has_direct_input | RO:0002233 has_end_location | RO:0002339 exists_during | RO:0002491

leemdi commented 1 year ago

@sierra-moxon

Please add Li Ni to this thread.

David H. & Li are checking the results of your new merged file (merged_gpad/11-16-2023) on my MGI-Scrum database. Things went smoothly, except for the PR:PR: that we already talked about yesterday.

One interesting thing is that I only found 1 RefSeq in the gpad. Should there be more?

RefSeq:NR_028355 RO:0002327 GO:0030374 PMID:15180993 ECO:0000314 2020-11-17 MGI noctua-model-id=gomodel:MGI_MGI_13444 14|contributor=https://orcid.org/0000-0003-2689-5511|model-state=production

leemdi commented 1 year ago

And will this be the home of the new gpad that I will be picking up?

https://snapshot.geneontology.org/annotations/mgi.gpad.gz

balhoff commented 1 year ago

@sierra-moxon by the way for the line in the previous comment:

RefSeq:NR_028355 RO:0002327 GO:0030374 PMID:15180993 ECO:0000314 2020-11-17 MGI noctua-model-id=gomodel:MGI_MGI_13444 14|contributor=https://orcid.org/0000-0003-2689-5511|model-state=production

All ID values must be CURIEs using a prefix from db-xrefs.yaml, so the ORCID should be orcid:0000-0003-2689-5511.

ukemi commented 1 year ago

@leemdi @LiNiMGI The annotation was made by an MGI curator and was part of the set that was migrated to Noctua. Annotations to Refseq objects would not be coming from one of the loads (ISO or GOA-Mouse). The easiest way to check this would be to look at input for the current Noctua load and see if this is the only one. IIRC it is. We cleaned up a lot of these when we migrated from MGI to Noctua.

ukemi commented 1 year ago

I'm trying to track down this annotation now. In the future, it would be really nice if a report were generated that would make it easy for a curator to see where an annotation in the mega-file comes from as well as to see what annotations were filtered out for various reasons.

ukemi commented 1 year ago

Here is the model: http://noctua.geneontology.org/editor/graph/gomodel:MGI_MGI_1344414?

Here is the annotation from the noctua_mgi.gpad: RefSeq NR_028355 enables GO:0030374 PMID:15180993 ECO:0000314 20201117 MGI noctua-model-id=gomodel:MGI_MGI_1344414|contributor=https://orcid.org/0000-0003-2689-5511|model-state=production

There is only one RefSeq annotation in the file, so all is working as expected.

ukemi commented 1 year ago

Hi @sierra-moxon It looks like the PAINT annotations are not in your file. I can't find PMID:21873635 in the file @leemdi loaded. @LiNiMGI

dustine32 commented 1 year ago

@ukemi It's likely due to this recent change we made https://github.com/pantherdb/fullgo_paint_update/issues/58 as requested by @pgaudet. So, PAINT IBAs now have GOREF:0000033 in the reference column instead of PMID:21873635.

ukemi commented 1 year ago

Thanks @dustine32 That means they will be filtered on our end because we will need to add that reference to MGI. Problem solved!

ukemi commented 1 year ago

Hi @dustine32 and @sierra-moxon

I just tried to add the ref to MGI and it was already there. Further inspection shows no annotations with ref GO_REF:0000033 in the file @leemdi used either.

sierra-moxon commented 1 year ago

Dumb question: which part of this process should have brought in the PAINT annotations to the resulting GPAD?

Here's what I did: generate MGI annotations via orthology with Rat into GAF generate MGI annotations via orthology with Human into GAF generate MGI GOA annotations into GAF generate MGI GOA isoform annotations into GAF combine all four files into one GAF translate the resulting merged GAF into GPAD 2.0 format translate the mgi-nocuta gpad 1.2 produced from the pipeline into GPAD 2.0 concatenate the merged GPAD 2.0 from the "remainders" with the converted GPAD 2.0 from noctua and give the resulting file to Lori :)

So, it might be that I missed including a PAINT GAF conversation into GPAD 2.0 to merge with the other GPAD 2.0s above?

ukemi commented 1 year ago

Actually not a dumb question at all. I had to go back and look! It looks like it is a step we forgot! @leemdi should confirm, but it looks like they are filtered from the mgi.gaf that the GOC produces. We call it the gorefgen load! AHHHHHH! There has to be a better way than that though. @kltm can help. In the mgi.gaf I see: Header copied from paint_mgi_valid.gaf !================================= !Created on Mon Jun 5 16:29:03 2023. !generated-by: PANTHER !date-generated: 2023-06-05 !PANTHER version: v.17.0. !GO version: 2023-05-10.

Maybe add this one: http://snapshot.geneontology.org/products/upstream_and_raw_data/paint_mgi.gpad.gz?

We also have the 'inference file' that we load: We call it the GO cfp load. Again, @leemdi can point you to the source, but it comes from the GOC. I think it is this one: http://snapshot.geneontology.org/products/upstream_and_raw_data/mgi-prediction.gaf

THese should be straightforward (he says) because these annotations already have a GOC source.

kltm commented 1 year ago

I'm not quite following here. If somebody wants to clarify the data flow, please grab me as needed so we can draw it out.