geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add remainder mouse annotation file from uniprot upstreams to MGI data stream #329

Open kltm opened 1 year ago

kltm commented 1 year ago

Add remainder mouse annotation file from uniprot upstreams to MGI data stream.

Listed as: ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/goa_mouse.gaf.gz ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/MOUSE/goa_mouse_isoform.gaf.gz

kltm commented 1 year ago

Tagging @sierra-moxon @ukemi

ukemi commented 1 year ago

@leemdi noticed today that the current mouse GPAD file, http://snapshot.geneontology.org/annotations/mgi.gpad.gz, is missing all of the column 12 data. When it exists, these data should include information about Noctua models.

The noctua_mgi.gpad: MGI:MGI:1918911 RO:0002327 GO:0003674 MGI:MGI:2156816|GO_REF:0000015 ECO:0000307 2010-02-09 MGI BFO:0000050(GO:0008150) creation-date=2010-02-09|modification-date=2010-02-09|model-state=production|noctua-model-id=gomodel:MGI_MGI_1918911|contributor-id=https://orcid.org/0000-0003-3394-9805

The same annotation from the above referenced file: MGI MGI:2685011 involved_in GO:0007389 PMID:26258302 ECO:0000315 MGI:MGI:4867020 20160729 MGI part_of(GO:0003183),part_of(GO:0009653),occurs_in(UBERON:0007151)

Looks like the file format for the mgi.gpad file is out of date.

leemdi commented 1 year ago

@ukemi @kltm @sierra-moxon

I don’t see noctua-model-id, etc. in mgi.gaf either.

I only see them in noctua_mgi.gpad

ukemi commented 1 year ago

The gaf won't have the Noctua info. I'm not even sure it is possible to have it in there (I think in GPAD these days). But it looks like the mgi.gpad file that is on snapshot (above) needs quite a bit of work and therefore failed @leemdi 's tests. As @sierra-moxon works through this project, hopefully it will come up to snuff like the noctua_mgi.gpad file.

kltm commented 1 year ago

@ukemi Yes, the GPAD coming from that direction would generally have received less attention as that has not been an official exchange product. That said, the format utilization should be harmonized, insofar as the information exists in the first place for the GAF-derived GPAD lines vs the GO-CAM-derived GPAD lines.

@sierra-moxon @dustine32 @ukemi , I think it's probably best to have another ticket added to the project that's GPAD output harmonization. I'm not sure who best to tackle that; it has likely not been of much concern since Eric last worked on it some time ago. @sierra-moxon, if you wanted to take a look at it first, that might work; otherwise, we can bring on @dustine32. Does that sound right to the both of you?

sierra-moxon commented 1 year ago

sounds good.

ukemi commented 1 year ago

Note that the format should harmonize with the noctua_mgi.gpad file since this is the one we currently load.

kltm commented 1 year ago

Fiddly note on note that the format should be "correct" with respect to the spec, which should be reflected in our outputs. That may not be a completely trivial task. I believe our current outputs are in a slightly indeterminate state: we have GPAD 2.0 coming in for GO-CAM, but probably have some divergence between ontobio output and minerva output, which are both likely not GPAD 2.0-based. Ideally, all things are in a known and documented standard format. First, of course, is to determine what the current state of play even is...

sierra-moxon commented 8 months ago

Per feedback from @kltm: these two files should be passed through "as is" (with a reconstructed header that reflects these headers and the headers added from the other files in this bigger "MGI-remainders" process. -- e.g. the annotation files containing mouse annotations converted from human and rat annotations via orthology should be combined with these two to produce the final MGI automated annotation GAF).

sierra-moxon commented 8 months ago

So two req'ts from David:

  1. remove the annotations where provided_by = MGI ? (lots are provided_by GO_Central - keep these?)
  2. convert UniProt ids to MGI ids
sierra-moxon commented 8 months ago

https://docs.google.com/document/d/1lJp_PAQ4517ADcJnU96YRLtUCRSKz7MXT2anov2Udvc/edit

ukemi commented 8 months ago

Summary from yesterday's call with @ukemi @sierra-moxon @kltm @leemdi

  1. The GOC does not have to recapitulate all of the steps that are used in the MGI pipeline. Many of those steps are in place so that the data can be loaded and coordinated with the rest of MGI. The annotations that are not loaded are added back to the gaf that MGI creates for the community. It doesn't make sense for the GOC to do this filtering and re-addition step.
  2. @leemdi will need to modify the loads at MGI so that the 'MGI-specific filters' can be implemented as we do the loads.
  3. IEA annotations will also be retained from the GOA-mouse file.

The GOC (@sierra-moxon ) will need to do the following processing:

-QC/Awareness

If possible, keep track of uniprot mapping “misses”, so that MGI can “repair” their GPI (TBD)

ukemi commented 8 months ago

@sierra-moxon and @kltm

I have examined the GOA isoform file a bit today and most of the identifiers in it are not in our GPI file, although they are associated with MGI genes, for example UniProtKB:A0A075B5J2 is associated with MGI:98596, but it is not in the GPI file.

The header of the isoform file says: !The set of protein accessions included in this file is based on UniProt complete proteomes, which may provide more than one protein per gene. !They include all Swiss-Prot entries for the species plus any TrEMBL entries that have an Ensembl DR line. The TrEMBL entries are likely to overlap with the Swiss-Prot entries or their isoforms. !If a particular protein accession is not annotated with GO, then it will not appear in this file.

Looking at the annotations in this file, it looks like many of them are from TrEMBL and are not loaded into MGI but in many cases the gene already has a similar annotation in the standard gaf. For example this annotation in the isoform gaf: UniProtKB A0A024QYR9 Pten enables GO:0016314 GO_REF:0000003 IEA EC:3.1.3.67 F Phosphatase and tensin homolog Pten|PTEN protein taxon:10090 20230911 UniProt UniProtKB:A0A024QYR9

is from an un-reviewed TrEMBL record and is not loaded into MGI, even though there is a sequence->gene association. We need to ask @leemdi, but I think we filter annotations to TrEMBL records when we load. @leemdi this might be in the GOA-mouse load or in the UniProt load since they both generate GO annotations. Does the GOC want to load information from un-reviewed TrEMBL records? If so, there are at least two strategies we could take:

  1. If MGI doesn't put a sequence to gene association in our GPI for TrEMBL records, then this will not be translated to an MGI marker. It will be retained as an annotation at the GOC and be associated with the TrEMBL identifier.
  2. If the GOC does not want annotations associated with TrEMBL records, we will need to catch this in one of @sierra-moxon 's filtering steps.
ukemi commented 8 months ago

@sierra-moxon it would be really helpful for me if at first you could process the goa_mouse_isoform.gaf and the goa_mouse.gaf separately. I'd like to be able to tell how many annotations from each file are filtered.

leemdi commented 8 months ago

@ukemi @sierra-moxon

A0A024QYR9 is not being loaded into MGI as a GO/Mouse annotation because it fell into the "nopubmed.error" .

field 2,6

A0A024QYR9 GO_REF:0000003 A0A024QYR9 GO_REF:0000043 A0A024QYR9 GO_REF:0000002 A0A024QYR9 GO_REF:0000043 A0A024QYR9 GO_REF:0000043

leemdi commented 8 months ago

@ukemi @sierra-moxon

here is the uberon -> emapa obo file we use:

https://purl.obolibrary.org/obo/uberon.obo

there is an "xref:" for EMAPA terms

ukemi commented 8 months ago

@sierra-moxon @leemdi

I just downloaded our most recent mouse gaf from the MGI reports page: http://www.informatics.jax.org/downloads/reports/gene_association.mgi.gz

I wanted to get a count of how many annotations were in the file that used UniProt identifiers as the primary annotation object (column 2), which means we couldn't map them to mouse genes. To my surprise, there were none. @leemdi, can you confirm that we still add the 'unloadable' annotations to the end of the MGI gaf?

@sierra-moxon , if this is the case, it would add another filtering step for annotations where the UniProt identifier doesn't map to an MGI gene.

ukemi commented 8 months ago

The filtering step was confirmed on this morning's managers' call at the GOC level. If there are annotations that use a protein identifier and it cannot be mapped to a mouse gene, then we shouldn't include those in the annotation corpus.

ukemi commented 8 months ago

I have examined the annotations that have GO-REFs from the main file that we import from uniprot ([goa_mouse.gaf]. The results are in this spreadsheet: https://docs.google.com/spreadsheets/d/1LwwN3RgyGsDQfdggczJ34Qu78XV-WtkB1JtdBOHPZw4/edit#gid=0

Summary:

  1. At this point, MGI only runs a 'pipeline' for three of the methods used by UniProt to predict annotations, InterPro2GO, EC2GO and SPKW2GO. We will need to accommodate the other methods at MGI by creating references to allow their import.
  2. The GOC seems to combine the rat and human ISO pipelines at MGI into one reference. We may need to split this since the orthology comes from different sources?????
  3. NB I did not look at the isoform file.
sierra-moxon commented 8 months ago

GOA GAF conversion results (two files b/c we process isoform file separately): https://drive.google.com/drive/folders/1pcMltYV_mKbIzPF19W4exKHd6TEsGLa-

ukemi commented 8 months ago

Thanks @sierra-moxon After discussions with other curators and looking at the isoform file, I wonder if we can just skip it. But I will analyze your output too. I made a spreadsheet for the MGI files, but I was a bit confused as I couldn't find the annotations that were in the load file in the original input file. Maybe @leemdi can help.

ukemi commented 8 months ago

I take this back. It looks like these are legitimate annotations to isoforms. However, many are redundant with the non-isoform annotations. Redundancy rears its ugly head again.

We would convert the IDs to PRO ids since that is what we use for protein representation in MGI and therefore is what is in our GPI file. For example UniProtKB:P54830-1=PR:P54830-1. Here is the line in our GPI file: PR:P54830-1 mPTPN5/iso:mSTEP61 tyrosine-protein phosphatase non-receptor type 5 isoform mSTEP61 (mouse) mPTPN5/iso:mSTEP61 PR:000000001 NCBITaxon:10090 MGI:MGI:97807 UniProtKB:P54830-1

I notice that you made the provider GO_Central. This is not the case for these annotations because they are manual annotations made by individual groups. They should retain the original source. This one: Ptpn5 | enables | GO:0005515 | PMID:23932588 | IPI | UniProtKB:Q16539-1 | F | Tyrosine-protein phosphatase non-receptor type 5 |   | protein | taxon:10090 | 20230909 | GO_Central |   | UniProtKB:P54830-1 |   should be UniProt if it says UniProt, Intact if it says Intact etc and it is from this load: UniProtKB P54830 Ptpn5 enables GO:0005515 PMID:23932588 IPI UniProtKB:Q16539-1 F Tyrosine-protein phosphatase non-receptor type 5 Ptpn5 protein taxon:10090 20230909 IntAct UniProtKB:P54830-1

I think we will have to skip ones like this because they won't resolve to PRO identifiers, but I need to investigate this more: MGI:95674 | Gcg | involved_in | GO:0032092 | PMID:19915011 | IDA |   | P | Pro-glucagon |   | protein | taxon:10090 | 20130626 | GO_Central |   | UniProtKB:P55095-PRO_0000011280

ukemi commented 8 months ago

Hi @sierra-moxon

Thinking about the step where we replace UniProtKB:$$$$$$$ with PR:$$$$$$$ in the with field and the isoform field. It's not just a simple switch.

A second way to do this would be to use the GPI file. Since it represents 'truth', any UniProtKB identifier in the 'proteoform' or 'with' filed that can be switched would be in the GPI file in lines that look like this: PR:Q8K1L6 m1190005I06Rik uncharacterized protein C16orf74 homolog (mouse) m1190005I06Rik PR:000000001 NCBITaxon:10090 MGI:MGI:1916168 UniProtKB:Q8K1L6

ukemi commented 7 months ago

Notes from yesterday's call:

ukemi commented 7 months ago

@sierra-moxon

@LiNiMGI and I just manually spot-checked the isoform annotations file above and determined that we should include these annotations. We realize that some will not map to MGI annotatable objects such as UniProtKB:P55095-PRO_0000011280, but most will.

So an annotation like this: Cldn18 | involved_in | GO:0120192 | PMID:22079592 | IMP |   | P | Claudin-18 |   | protein | taxon:10090 | 20221108 | GO_Central |   | UniProtKB:P56857-2 |  

Will look something like this in the final GPAD: PR:P56857-2 |   | RO:0002331 | GO:0120192 | PMID:22079592 | ECO:0000315 | |   | | UniProt

Note I changed the provider to match what is in the original annotation: UniProtKB P56857 Cldn18 involved_in GO:0120192 PMID:22079592 IMP P Claudin-18 Cldn18 protein taxon:10090 20221108 UniProt UniProtKB:P56857-2

sierra-moxon commented 7 months ago

@ukemi @LiNiMGI - should the ones that came in, in this last round "with no dashes" also be included? Or should I just weed those out based on the absence of a dash?

LiNiMGI commented 7 months ago

@sierra-moxon -the file we checked this morning is the current "goa mouse converted - isoform" file in google drive, which does not include the "with no dashes" ones?

ukemi commented 7 months ago

I think you should weed them out because I don't think they always represent an isoform and as we saw yesterday, we don't really know if they represent the gene or the protein because both are in our GPI. Hopefully @LiNiMGI agrees, but I think we should follow the strict rule about the GPI representing the valid annotation objects in MGI. However, I do wonder if those that don't have the dashes are represented by what is essentially a duplicate annotation in the non-isoform file.

ukemi commented 7 months ago

@sierra-moxon, if you want us to be doubly sure, we need some examples of non-dash ones to trace. Can you provide a couple examples of ones that are in the GPI but don't have dashes?

ukemi commented 7 months ago

We just found the file from yesterday. @LiNiMGI and I just looked at some of the non-dash ones and they are bona-fide isoforms. https://docs.google.com/spreadsheets/d/1tDRaQijLGD0e81eZSP14OQmuL7r-PURk_R74VGuQg2A/edit#gid=0 The first one, UniProtKB:A0A087WRD7, is an isoform in MGI and we should take it. So the bottom line is that if the UniProt ID is associated only with a PRO id in the MGI GPI, then we should create the annotation:

Line in GPI: PR:A0A087WRD7 mStpg4/iso:short protein STPG4 isoform short (mouse) mStpg4/iso:short PR:000000001 NCBITaxon:10090 MGI:MGI:1922717 UniProtKB:A0A087WRD7

sierra-moxon commented 7 months ago

So the new rule is not that we should only take "dashed" UniProt IDs, but that we need to check if the UniProt ID is associated with only a PRO id (those associated with an MGI as well, should be weeded out).

Sorry to be pedantic 🥴 - does this new rule apply to the "dashed" UniProt ids as well?

ukemi commented 7 months ago

No problem! I apologize that I've managed to make this really confusing.

Hopefully you can do this:

  1. Check to see if the UniProt identifier in column 17 of the incoming isoform gaf is in column 10 of the GPI file
  2. If it isn't, skip the annotation (and write it to a report?)
  3. If it is in the GPI file, does it map to only a PRO identifier in column 1 of the GPI or does it map to both a PRO identifier in column 1 and and on another line MGI identifier in column 1 (2 occurrences in the file versus 4 for the ID)?
  4. If it maps to only a PRO identifier in column 1 of the GPI then create the annotation with the PRO identifier in column 1 of the GPAD
  5. If it maps to both a PRO identifier and an MGI identifier, skip it (and write it to the report?). It's weird that these exist and for now, we don't know what to make of them. Maybe some day we can work with Uniprot to define what they mean and modify this strategy.

Here is one that maps to both in the GPI UniProtKB:Q8VHW3 (skip): MGI:MGI:1859168 Cacng6 calcium channel, voltage-dependent, gamma subunit 6 2310033H20Rik SO:0001217 NCBITaxon:10090 UniProtKB:Q8VHW3

PR:Q8VHW3 mCACNG6 voltage-dependent calcium channel gamma-6 subunit (mouse) mCACNG6|neuronal voltage-gated calcium channel gamma-6 subunit (mouse) PR:000000001 NCBITaxon:10090 MGI:MGI:1859168 UniProtKB:Q8VHW3

Here is one that maps to only one in the GPI UniProtKB:A0A087WRD7 (keep): PR:A0A087WRD7 mStpg4/iso:short protein STPG4 isoform short (mouse) mStpg4/iso:short PR:000000001 NCBITaxon:10090 MGI:MGI:1922717 UniProtKB:A0A087WRD7

ukemi commented 7 months ago

Ticket opened for annotations that are missing in the new load, discovered by @leemdi. This is not on our end, but should be noted as part of the project. https://github.com/geneontology/go-annotation/issues/4831

ukemi commented 7 months ago

Hi @sierra-moxon. I can't remember now whether we left the 'validation' of annotations in the isoform file at only taking ones that had a hyphenated suffix. Today while looking at one of our QC reports, I found some isoforms that don't have a suffix. For example, UniProtKB:D3YX90 is in our GPI cross-referenced to what appears to be a legitimate isoform in PRO:

PR:D3YX90 mADAMTS17 a disintegrin and metalloproteinase with thrombospondin motifs 17 (mouse) mADAMTS17 PR:000000001 NCBITaxon:10090 MGI:MGI:3588195 UniProtKB:D3YX90

It is not cross-referenced to an MGI marker/gene directly. At the end of the day, the best way to determine if an entity is valid for annotation in MGI remains to be whether you can find it in the GPI file and the two files from the UniProt upstream would be processed differently, the non-isoform file would be checked for x-ref to a gene (MGI:MGI:) and the isoform file would be checked against PR: identifiers as above.

pgaudet commented 3 months ago

@LiNiMGI Can you please check whether these are being injected?

LiNiMGI commented 3 months ago

@sierra-moxon @ukemi @leemdi For the GOA_mouse isoform file load, from David's comments on Nov. 17, 2023 above:

  1. Check to see if the UniProt identifier in column 17 of the incoming isoform gaf is in column 10 of the GPI file
  2. If it isn't, skip the annotation (and write it to a report?)

A. Just wondering, do GOC happen to have a report for any skipped annotations?

B. Also in the 3/20/2024 GPAD, I see annotations like: UniProtKB:O54824-PRO_0000015418 RO:0002264 GO:1902565 PMID:30089723 ECO:0000314 2020-03-05 ARUK-UCL BFO:0000066(UBERON:0002048)

UniProtKB:O54824-PRO_0000015418 RO:0002331 GO:0019221 PMID:30089723 ECO:0000314 2021-04-30 ARUK-UCL BFO:0000066(UBERON:0002048)

At the moment, MGI will filter those annotations out since UniProtKB:O54824-PRO_0000015418 is not in our GPI. Sierra, did you do an extra step check for UniProtKB:O54824 to bring those in instead of skipping them?

C. We would like to have those annotations. we can either add those to MGI GPI (not sure how), or map the annotations to UniProtKB:O54824 (we will lose the isoform specificity of the annotation).

sierra-moxon commented 3 months ago

A. - No, no report for skipped annotations in the preprocess pipeline (we do have the reports from the GORules).

B. Those records come from the Protein2GO isoform file like this:

UniProtKB   O54824  Il16    acts_upstream_of_or_within  GO:0010628  PMID:30089723   IDA     P   Pro-interleukin-16  Il16    protein taxon:10090 20200305    ARUK-UCL    has_input(UniProtKB:P17515),part_of(GO:0032722),occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418
UniProtKB   O54824  Il16    acts_upstream_of_or_within  GO:0032722  PMID:30089723   IDA     P   Pro-interleukin-16  Il16    protein taxon:10090 20200526    ARUK-UCL    has_input(UniProtKB:P17515),occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418
UniProtKB   O54824  Il16    acts_upstream_of_or_within  GO:1902565  PMID:30089723   IDA     P   Pro-interleukin-16  Il16    protein taxon:10090 20200305    ARUK-UCL    occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418

in preprocess pipeline, I use the mouse GPI to map UniProtKB:O54824 to MGI:1270855 so they appear like this:

MGI MGI:1270855 Il16    acts_upstream_of_or_within  GO:0010628  PMID:30089723   IDA     P   Pro-interleukin-16      protein taxon:10090 2024031ARUK-UCL has_input(UniProtKB:P17515),part_of(GO:0032722),occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418
MGI MGI:1270855 Il16    acts_upstream_of_or_within  GO:0032722  PMID:30089723   IDA     P   Pro-interleukin-16      protein taxon:10090 2024031ARUK-UCL has_input(UniProtKB:P17515),occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418
MGI MGI:1270855 Il16    acts_upstream_of_or_within  GO:1902565  PMID:30089723   IDA     P   Pro-interleukin-16      protein taxon:10090 2024031ARUK-UCL occurs_in(UBERON:0002048)   UniProtKB:O54824-PRO_0000015418

then in the GPAD emission, discussed here, I replace the "subject.id" with the value of the isoform identifier:

UniProtKB:O54824-PRO_0000015418 RO:0002264 GO:1902565 PMID:30089723 ECO:0000314 2020-03-05 ARUK-UCL BFO:0000066(UBERON:0002048)
UniProtKB:O54824-PRO_0000015418 RO:0002331 GO:0019221 PMID:30089723 ECO:0000314 2021-04-30 ARUK-UCL BFO:0000066(UBERON:0002048)
LiNiMGI commented 3 months ago

Thanks @sierra-moxon, Li will find out whether we can get a PR:ID for them. Li

pgaudet commented 2 months ago

Discussing with @LiNiMGI

The task is to convert UniProt chains to PRO ID.

Once PRO (Protein Ontology) IDs are available for these UniProt chains (UniProt:O####), then Li will convert the UniProt-chain ID space to the PRO (Protein Ontology)ID space >> how ?? Can UniProt chains to PRO ID be obtained form the GPI? or does this need to be done manually?