Add RGD orthology file to MGI datastream

kltm commented 1 year ago

Replace the current MGI upstream pipeline with a local GO pipeline for the RGD orthology file.

Would be using the GO RGD file, the MGI GPI file, and a TBD (AGR) orthology file.

kltm commented 1 year ago

Tagging @sierra-moxon @ukemi

kltm commented 1 year ago

From Lori:

We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz

This file is Orthology Filter: Stringent

sierra-moxon commented 1 year ago

@kltm - confirming the "GO RGD file" is the RGD GPI file here? RGD GPI

- nevermind, I see Lori's comment in the human ticket and I'll list it here as well. - instead of using Entrezgene, I guess my question remains - should I use the RGD GPI file?

sierra-moxon commented 1 year ago

@ukemi @kltm @sierra-moxon

• We use this Alliance file: fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE_COMBINED.tsv.gz o This file is Orthology Filter: Stringent • We use the Rat markers from Entrezgene to load our Rat genes/RGD: into MGI o http://ftp.ncbi.nih.gov/gene/DATA o gene2accession.gz gene2pubmed.gz gene2refseq.gz gene_history.gz gene_info.gz • We use columns 1 = MGI, column 5 = RGD, going in one direction only • As long as both MGI: and RGD:* exist in our database, we load the MGI/Rat association.

Let me know if there is any other info you need about this.

Thanks. Lori

kltm commented 1 year ago

@sierra-moxon That one, except I'd generally opt for the snapshot version. Technically, when it comes time, you will be referencing the version from inside the pipeline, but that can wait a little longer.

sierra-moxon commented 1 year ago

I started with the rat annotations and have a GAF file based on your requirements. (I shared via email). the status here is that I am currently comparing my output to the file I can download from MGI - the nonoctua file.

ukemi commented 1 year ago

@sierra-moxon I will sanity check this while @leemdi is away. Is this file an intermediate step to eventually building the GPAD we will consume? In other words, should I just be sanity checking the integrity of the data, mappings etc?

ukemi commented 1 year ago

Column H should always be populated with the RGD identifier from the original annotation.

sierra-moxon commented 1 year ago

This file is the one that will feed the GO pipeline, replacing the file we pick up from MGI. I'd like to make sure this file recapitulates the file you produce.

Then, the next step is to make sure that the "annotations" files we produce in the GO Pipeline itself, are of a flavor you can use? From another ticket I see that this file for example(the MGI gpad that comes out of the pipeline), is of a slightly different GPAD flavor (version) than this one that you use to consume noctua annotations only? The both are version 1.2 according to the headers.

kltm commented 1 year ago

@sierra-moxon I think the second part is technically another discussion: changing our source for the upstream has no effect on the products. Switching to GPAD 2.0 across the board at some point likely would.

ukemi commented 1 year ago

Ah, I see. There is still the issue with the 'with' field being populated incorrectly. It should always have the RGD identifier from the original annotation. The one you posted has: nothing, RGD, Uniprot, ChEBI....

sierra-moxon commented 1 year ago

thanks for taking a look @ukemi! :) the current file from MGI nonoctua gaf is missing RGDs in column H (some rows have an RGD id, some have UniProt ids in column H, some have no ids in column H). For this new file I am generating; to clarify, I should convert all UniProt ids that come in with the original annotation (from RGD) to RGD ids? Or, should I add the RGD ids to the UniProt ids in that column? And for those that are currently null - I will add RGD ids from the original annotation.

ukemi commented 1 year ago

Sorry @sierra-moxon I was confused about what this file represented. The current file that you point to above has everything in it, the rat orthology annotations, the human orthology annotations and all of the IEA annotations. I thought you were only trying to convert the rat orthology annotations at this point. Are you in fact trying to recreate the entire non-noctua file? In that case the 'with' field will indeed be a mixed bag of things.

ukemi commented 1 year ago

@sierra-moxon I just reread your post. If the annotation comes from the rat annotation file, RGD, then column H should have the RGD identifier from the original annotation, column 2. The RGD identifier in column 2 from the original annotation should be replaced with the orthologous MGI identifier and all the evidence codes should be ISO. Same for human.

sierra-moxon commented 1 year ago

thanks @ukemi I had already replaced the RGD identifier with the MGI identifier, and now I've also added the RGD identifier to column H - I appended the RGD to the existing value from RGD (the spreadsheet has been updated to reflect this new change and the new file if you want to confirm). :)

ukemi commented 1 year ago

Because I forsee myself getting confused about what files are where, I have created a new folder on the shared drive under Gene Ontology/Working Groups/ MGI imports/Integrate remainder of MGI pipeline into the GO pipeline. I have made a copy of the annotation file from @sierra-moxon in the new folder. I will add documentation of the kind of checks I do to that folder. I hope everyone is ok with this.

ukemi commented 1 year ago

First sanity check of the file: Sanity check the file for overall structure (June 19).

Sorting on column H, I still see identifiers that shouldn’t be in the column. I see CheBI, complex portal, UniprotKB, Panther etc. The only value in this column should be the RGD identifier that came from the original annotation.
It looks like some of the filtering that is described here: https://docs.google.com/document/d/123o6GJ0lBwE7xUPM_LJXDJ-DoZeCN7Zh/edit Didn’t work. I see annotations to GO:0005515 but none to GO:0005488.
Looking at the annotations that have Panther identifiers, I suspect this is due to the inclusion of IBA annotations in the translation. The first step of the conversion should be to filter the rat annotations so that only ones that have experimental evidence are used to generate mouse annotations. 'IDA', 'IPI', 'IGI', 'IMP', 'EXP'
If I look at column O, I see lots of providers. In order to not eat our own tail, we should filter out any annotation that is provided by MGI. I suspect many of the other providers will disappear once the evidence code filter is in place, eg the GO_central)
We do not import annotation extensions for ISO annotations. Column P should be blank.
There are 116,870 annotations in the file.

ukemi commented 1 year ago

@sierra-moxon I'm still a bit confused about this file. If I look at some of the first few annotations that have panther identifiers in column H, they seem to not be coming from the rat orthology file. Instead they seem to be from the MGI load that loads the GOC annotations. If this is correct, then they should retain the original annotation data and get the original IBA evidence code and reference to the Gaudet paper. Could we meet so you can go over this file with me wrt how it was generated?

sierra-moxon commented 1 year ago

Hi @ukemi - thanks a bunch - a meeting would be terrific, I'll set one up for us. :)

"Sorting on column H, I still see identifiers that shouldn’t be in the column. I see CheBI, complex portal, UniprotKB, Panther etc. The only value in this column should be the RGD identifier that came from the original annotation."

I added the RGD identifier to the identifiers already in that column in the source RGD annotation file. I will remove the identifiers that RGD has assigned there and replace them with only the RGD id. Am I understanding that correctly?

_"It looks like some of the filtering that is described here: https://docs.google.com/document/d/123o6GJ0lBwE7xUPM_LJXDJ-DoZeCN7Zh/edit Didn’t work. I see annotations to GO:0005515 but none to GO:0005488."_

I will check, they are both listed in my code as ids in an exclude list, so I will have to see what is going on.

"Looking at the annotations that have Panther identifiers, I suspect this is due to the inclusion of IBA annotations in the translation. The first step of the conversion should be to filter the rat annotations so that only ones that have experimental evidence are used to generate mouse annotations. 'IDA', 'IPI', 'IGI', 'IMP', 'EXP'"

I definitely misinterpreted this line from the filter doc: "skip: if Evidence code is in list (field 7) (['IDA', 'IPI', 'IGI', 'IMP', 'EXP']) keep" to mean "skip if the original annotation uses one of these evidence codes" -- so I will fix. :)

_"If I look at column O, I see lots of providers. In order to not eat our own tail, we should filter out any annotation that is provided by MGI. I suspect many of the other providers will disappear once the evidence code filter is in place, eg the GOcentral)"

The original RGD annotation file does specify providers other than RGD. Per this requirement in the filter file: "skip: if assigned by value = 'MGI' (field 15) (no eat tail) -- I interpreted "assigned_by" to mean "provided_by" ? Is this right? I filter any with "provided_by" = MGI. Then, I replace "RGD" in the "provided_by" column with "MGI" when I create the MGI annotation. I leave the other providers there as there was no other requirement to handle them - this was on my list of questions to ask in a meeting :)

"We do not import annotation extensions for ISO annotations. Column P should be blank."

ok, I will add this requirement to the filter doc now.

"There are 116,870 annotations in the file."

Thank you! :) -- this is 116,870 sourced from RGD/rat right?

ukemi commented 1 year ago

That's where I think I am confused. It would be good for you to go over the file with me so I can see exactly how you generated it. I wasn't sure if everything was sourced from rat or if you injected rat annotations into some other file you generated. Not to be a pain, but I think it would be best to look at stuff together. I may have given you unclear feedback because I still wasn't sure what I was looking at. Sorry!

sierra-moxon commented 1 year ago

That sounds good. For the ticket (and honestly I need to document the preprocessing pipeline and this can be a start :)), here's what I am doing in the code for this (the locations of these files can change):

Grab and parse the Alliance orthology file from here: https://fms.alliancegenome.org/download/ORTHOLOGY-ALLIANCE-JSON_COMBINED.json.gz Uses a standard JSON parser.
Grab and parse the MGI GPI file from here: http://snapshot.geneontology.org/annotations/mgi.gpi.gz This uses ontobio for parsing.
Grab and parse the RGD GO annotation file from here: http://snapshot.geneontology.org/annotations/rgd.gaf.gz This uses ontobio for parsing and the GoAnnotation object in particular.

The code then applies the rules from this document, and the data structures parsed from the MGI GPI file and the Alliance orthology file to isolate and transform the valid RGD annotations into MGI annotations based on orthology (ISO).

Does this help?

ukemi commented 1 year ago

Yes. This helps, but it's still unclear to me whether the file you posted are only those annotations derived from the rat or if it is those plus something else. It seems like it is those plus something else when I look at individual rows. I'll invite you to a meeting.

sierra-moxon commented 1 year ago

I have a feeling/hypothesis that because I have the "evidence code" requirement flipped in the current iteration, you're seeing a lot more than you should. That or the RGD annotation file has more in it than just rat -- I can check the taxon.

ukemi commented 1 year ago

The new file created on June 21 looks much better. I have put a copy of it here: https://docs.google.com/spreadsheets/d/1R4JYh5wfio9oipuv99tqhUcBIqKgxqZfJwiRvaVd8b0/edit#gid=0

The ball is now in my court to have a more specific look at the annotations. Things we noticed on the June 21 call.

[x] The evidence code is missing
[x] Column P is populated
[x] Annotations are 'duplicated' (Not necessarily a big deal at this point, but we can check with @leemdi)
[x] Checking the provider, it looks like we assign everything to MGI regardless of where it first came from. This one first came from UniProt: (gene_association.mgi) MGI MGI:109243 St8sia5 located_in GO:0000139 MGI:MGI:4417868|GO_REF:0000096 ISO RGD:1302934 C ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5 Siat8e|ST8SiaV protein_coding_gene taxon:10090 20200114 MGI
(gene_association_nonoctua.mgi) MGI MGI:109243 St8sia5 located_in GO:0000139 MGI:MGI:4417868|GO_REF:0000096 ISO RGD:1302934 C ST8 alpha-N-acetyl-neuraminide alpha-2,8-sialyltransferase 5 Siat8e|ST8SiaV protein_coding_gene taxon:10090 20200114 MGI

ukemi commented 1 year ago

Agenda for June 27 meeting: Meeting summarized in the human ticket #328

Report from @ukemi
-Status of QC on rat file so far
-Wish list
Report from @sierra-moxon
-Status on bug fixes
-Human file underway?
Next steps

ukemi commented 1 year ago

@sierra-moxon I tested our hypothesis that the missing annotations in your file versus the MGI file were from the UniProt identifiers. When I look at the RGD gaf, it looks like all the uniprot identifiers are associated with the IBA evidence code, so in a practical sense, they should all be filtered out. The ~400 missing annotations must be going missing somewhere else. Now I'm curious!

ukemi commented 1 year ago

@sierra-moxon @kltm I brought up the issue of the PMIDS from the original annotation being available to curators at lab meeting this morning. The curators thought it was critical to have this information. we will have to have it in the final GPAD that MGI picks up.

kltm commented 1 year ago

@ukemi @sierra-moxon Unfortunately, that's possibly bad news. The way we are setup is that we are going to pick up the GAF version of the data for processing, where no comment is possible. IIRC, the pipeline does not currently natively process GPAD/GPI, so that data cannot be passed forward and cannot be part of the final file. The choice would be to defer this until we are using GPAD/GPI throughout the pipeline (not in the current roadmap) or to figure out a workaround. While it may feel "critical", how much is this actually used and can we figure another workflow for debugging?

ukemi commented 1 year ago

But metadata that is not accommodated in the gaf is essential for the future in any case. For the Noctua annotations, it is essential that we are able to capture the curator OrcID and the model identifier. See the current GPAD output of the Noctua annotations as examples. So even if we don't grant this wish, we will face this issue that we need the extra data in the GPAD.

kltm commented 1 year ago

@ukemi We may have to sit down and talk about what data is available where, but I think in the case of noctua annotation data, we have access to more fields because we are treating GPAD and more natively at that point in the pipeline. In the first parts of the pipeline, I believe that GAF is only method of transferring data. I'll make a note of this for the software call today.

ukemi commented 1 year ago

I see what you are saying. The 'native' annotations from Noctua are already GPAD and have the extra data. The strategy for the other upstreams Rat and Human ISOs, for example, is to create a 'native' gaf file. So this won't include extra metadata. Am I understanding correctly?

kltm commented 1 year ago

@ukemi I believe that is more-or-less correct, but we'll need to sit down and work it out. The pipeline has some very primitive parts it in and basically operates on a file-by-file basis. The first stages of the pipeline make a lot of GAF-specific assumptions, but some of those are looser later on. I'm actually not 100% sure how the final merge with the noctua file into the main data stream works, so we'll have to have a discussion about that internally to make sure that we can get annotation metadata all the way to the end.

kltm commented 1 year ago

I think it is definitely still worthwhile to mock things all the way through with a test pipeline, which would also give us a better idea of what may need to be changed to support what we want/need to do.

ukemi commented 1 year ago

I think currently the GOC pipeline doesn't incorporate the mouse noctua annotations directly. MGI picks them up and injects them into the MGI gaf and gpad files that we release and the GOC uses for AmiGO2 etc. MGI picks up the noctua annotations as a 'GPAD' file with the extra metadata about models etc. You are correct. We need to mock thing all the way through. Let's touch base on this at next Wednesday's meeting.

ukemi commented 1 year ago

@kltm and @ukemi touched base on this on the 7/26 managers' call. @ukemi will point out the difficulties of loading the extra information at the MGI lab meeting and will demonstrate to MGI curators how to easily find the original PMID using links that can be put in the PWI search tool at MGI. Hopefully this will be acceptable. If this is the case, we can continue to move forward. The metadata from the noctua models will be injected intact to the final GPAD files created from the pipeline.

ukemi commented 1 year ago

Due to a heads-up from @leemdi this morning I checked into the providers in the various annotation streams.

When we load annotations into MGI, we keep track of the source of the annotation provider and each load is labelled. For example the rat load has RGD as the source in MGI.
When the annotations are exported from MGI they are attributed to MGI, presumably because we run the ISO generation software. This is what is in Sierra's file because we are trying to match the MGI output.
@leemdi uses the original source for our drop and reload procedure as well as providing some statistical reports for MGI use.
We will need to find ways to identify the annotations coming from different upstreams if the reports are still needed ( I suspect they will be). I don't think this will be hard on our end.
Is it fair to still attribute the ISO annotations to MGI when the software generating them is now being done at the GOC?
None of these are show stoppers, but we need to keep them in mind as we move forward.

sierra-moxon commented 1 year ago

@ukemi - I had the same question about provider…mostly because I was worried that you might want “issues” (if there are any) to be filed from users in geneontology git?

pgaudet commented 8 months ago

@LiNiMGI Can you please check whether these are now being injected in the products? with correct date and assigned_by GOC

LiNiMGI commented 8 months ago

notes for me: waiting for the new GPAD file to check "the date and assigned_by"

LiNiMGI commented 7 months ago

new GPAD file has the right "date and assigned_by" .

geneontology / pipeline

Add RGD orthology file to MGI datastream #327