geneontology / amigo

AmiGO is the public interface for the Gene Ontology.
http://amigo.geneontology.org
BSD 3-Clause "New" or "Revised" License
29 stars 17 forks source link

Duplicate annotations #531

Open pgaudet opened 6 years ago

pgaudet commented 6 years ago

Hello,

I don't know if similar cases have been reported (I thougth previous reports had more to do with redundant but distinct annotations). These appear completely identical image (one with a UniProt ID, one with a MGI ID).

Thanks, Pascale

ValWood commented 6 years ago

I have been complaining about both (forever).

BUT all of these annotations are present twice !

type

vanaukenk commented 6 years ago

These look like different Mus species, no? Or am I missing something else?

ValWood commented 6 years ago

Doh!

pgaudet commented 6 years ago

@vanaukenk Well that's surprising. The same paper tested two mice species ?

kltm commented 6 years ago

If there is to be a discussion of what is going on with upstream data, I would suggest opening a "cause" ticket in go-annotation--I'm pretty much the only active person around here.

pgaudet commented 6 years ago

@hdrabkin Can you please have a look and let us know whether this is right ?

hdrabkin commented 6 years ago

First glance: the gene names are suppose to be the same for the same symbol; B2m, beta-2 microglobulin; Something weird. In our Ei (which is looking at our db), we only have 4 annotations using this PMID; and we have nothing with a panther id in it . I just grepped our current annotation file. It is there 4 times, not 8.

pgaudet commented 6 years ago

@hdrabkin Do you export annotations to Mus spretus? This is the first time I see them.

hdrabkin commented 6 years ago

I guess they COULD be in the gpad and not the gaf, but I don't see how we would store them in our DB (that is, I can't look at them I just grepped gpa and gaf and cannot find an instance of taxon:10096

hdrabkin commented 6 years ago

We do append things to the gpad/gafs that we get from GOA that we can't load but then I would see the annotations when I grepped for the PMID; but there are only 4 instances of the PMID. I just love Mondays.

pgaudet commented 6 years ago

There are 8 in P2GO; @tonysawfordebi any idea what is happening ?

image
hdrabkin commented 6 years ago

We don't have them in MGI. Can you tell me if an MGI curator made them? I don't have access to p2go myself.

If not they would be in our GOA load and stripped because we would consider them duplicate annotations anyways, as reflected in the fact that they are not in the GAF and GPAD we export (I can only find 8); I don't see how AMIGO would have more than we have in our gaf and gpad.

hdrabkin commented 6 years ago

wondering if the GOA_mouse gaf we load only has taxon 10090? Yes, the file we get does NOT have 10096; confirmed; not in goa_mouse.gaf.gz

pgaudet commented 6 years ago

It shows up the same as the other MGI annotations - at any rate not like they were done in P2GO.

hdrabkin commented 6 years ago

I have no idea what so ever; we don't have them here in our db; I don't know what is going on! I'm gonna go git (haha) a beer.

pgaudet commented 6 years ago

Well, lets see what @tonysawfordebi says tomorrow.

kltm commented 6 years ago

Looking at the upstream data for AmiGO:

sjcarbon@moiraine:/tmp$:( grep 1927183 mgi.gaf 
MGI MGI:88127   B2m     GO:0006826  MGI:MGI:1203460|PMID:9531620    IMP MGI:MGI:1927183 P   beta-2 microglobulin    beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920    MGI     
MGI MGI:88127   B2m     GO:0033216  MGI:MGI:1203460|PMID:9531620    IMP MGI:MGI:1927183 P   beta-2 microglobulin    beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920    MGI     
MGI MGI:88127   B2m     GO:0045646  MGI:MGI:1203460|PMID:9531620    IMP MGI:MGI:1927183 P   beta-2 microglobulin    beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920    MGI     
MGI MGI:88127   B2m     GO:0071283  MGI:MGI:1203460|PMID:9531620    IMP MGI:MGI:1927183 P   beta-2 microglobulin    beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920    MGI     
sjcarbon@moiraine:/tmp$:) grep 1927183 goa_uniprot_all_noiea.gaf 
UniProtKB   Q04714  B2m     GO:0006826  PMID:9531620    IMP MGI:MGI:1927183 P   Beta-2-microglobulin    B2m protein taxon:10096 20160920    MGI     
UniProtKB   Q04714  B2m     GO:0033216  PMID:9531620    IMP MGI:MGI:1927183 P   Beta-2-microglobulin    B2m protein taxon:10096 20160920    MGI     
UniProtKB   Q04714  B2m     GO:0045646  PMID:9531620    IMP MGI:MGI:1927183 P   Beta-2-microglobulin    B2m protein taxon:10096 20160920    MGI     
UniProtKB   Q04714  B2m     GO:0071283  PMID:9531620    IMP MGI:MGI:1927183 P   Beta-2-microglobulin    B2m protein taxon:10096 20160920    MGI     
cmungall commented 6 years ago

Sounds like this is an upstream issue from amigo. Open a ticket in go-annotation as Seth suggests, assign to someone who can go into p2go and find the provenance of these.

hdrabkin commented 6 years ago

I just want to know that I understand it. So there are 4 annotations supplied by goa in the all_noIEA file that are for the taxon 10096? So that is why this is not in our gaf; it's not from MGI, even though it is being mapped to the MGI gene but for a different species? BTW, looking at PMID:9531620, in the M&M, they state that they are using C57BL/6 (B6). This is Mus musculus. The alleles used in the with are from Mus Musculus. I suggest these be deleted unless I'm missing something here.

krchristie commented 6 years ago

Looking in the MGI curation interface (view is sorted by paper to show all annotations from J:47457, aka PMID:9531620), MGI (specifically Dmitry) made four annotations from this paper, dated 2016-09-20.

20180816-mgi-ei-sourceannots

Then, in P2GO, there are now 8 annotations all credited to MGI with the same date (requires mouseover of calendar icon in P2GO to view date)

20180816-p2goannots

So, it looks to me like the issue is in how the original MGI annotations were propagated to UniProt. It seems incorrect to propagate these musculus annotations to spretus with the same experimental evidence code, so I agree that these should be removed.

MGI can not remove the spretus annotations because we did not make them in our interface and since P2GO says they are from MGI, they cannot be edited in P2GO.

They are going to need to be removed by whatever pipeline duplicated MGI's annotations to spretus.

hdrabkin commented 6 years ago

Yes I originally commented the same on the ticket; really weird.

From: Karen R Christie notifications@github.com Reply-To: geneontology/amigo reply@reply.github.com Date: Thursday, August 16, 2018 at 12:05 PM To: geneontology/amigo amigo@noreply.github.com Cc: me Harold.Drabkin@jax.org, Mention mention@noreply.github.com Subject: Re: [geneontology/amigo] Duplicate annotations (#531)

Looking in the MGI curation interface (view is sorted by paper to show all annotations from J:47457, aka PMID:9531620), MGI (specifically Dmitry) made four annotations from this paper, dated 2016-09-20.

[20180816-mgi-ei-sourceannots]https://user-images.githubusercontent.com/10533218/44217413-43a01e00-a12c-11e8-86e8-f41ebeb11e9a.jpg

Then, in P2GO, there are now 8 annotations all credited to MGI with the same date (requires mouseover of calendar icon in P2GO to view date)

[20180816-p2goannots]https://user-images.githubusercontent.com/10533218/44217429-4c90ef80-a12c-11e8-8854-8fc429a9bb99.jpg

So, it looks to me like the issue is in how the original MGI annotations were propagated to UniProt. It seems incorrect to propagate these musculus annotations to spretus with the same experimental evidence code, so I agree that these should be removed.

MGI can not remove the spretus annotations because we did not make them in our interface and since P2GO says they are from MGI, they cannot be edited in P2GO.

They are going to need to be removed by whatever pipeline duplicated MGI's annotations to spretus.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/geneontology/amigo/issues/531#issuecomment-413598165, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ9NkCXBbQ5AYWf7cu420hTPa2z6gC9Mks5uRZhJgaJpZM4V6UTV.

The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.

tonysawfordebi commented 6 years ago

@pgaudet I was away last week, hence the delayed reply - sorry!

I've looked at the case you mentioned, and have an explanation.

In the original MGI GPAD, there are indeed just four annotations, all to MGI:MGI:88127.

However, in the mapping file that we use to translate MGI identifiers to UniProtKB identifiers, and whose content is derived from UniProt cross-references, this particular MGI ID is mapped to two UniProtKB IDs:

MGI:MGI:88127 UniProtKB:P01887;UniProtKB:Q04714

Hence when we import the MGI annotations, those four annotations to MGI:MGI:88127 are expanded to eight, four to UniProtKB:P01887, and four to UniProtKB:Q04714.

pgaudet commented 6 years ago

Hi @tonysawfordebi

Welcome back !

How are these mappings generated ? It's surprising that the same ID goes to two proteins from different species.

Thanks, Pascale

tonysawfordebi commented 6 years ago

@pgaudet Thanks :) I don't know - that would be a question for the UniProt folk. If the relevant people are around, I'll ask them.

BTW, in P2G if you click on the little chain link icon at the right hand end of an imported annotation you can see the details of the original - pre-ID-translation - version of the annotation:

snap1 snap2

tonysawfordebi commented 6 years ago

I've had a chat with the relevant UniProt authority, and it seems that in a previous version of the mapping file that was supplied by MGI to UniProt MGI:88127 was indeed mapped to P01887 and Q04714.

However, the mapping to Q04714 has subsequently been deleted from the MGI-supplied file. This change will eventually filter through into the mapping files that we use in the import process, but unfortunately we can't specify any timescale for this, as it requires action on the part of Swiss-Prot curators to integrate this change into UniProt.

pgaudet commented 6 years ago

Great, thanks. So I guess this is fixed. I'd like to see the change in AmiGO before closing.

@hdrabkin just want to make sure you've seen this.

Thanks, Pascale

pgaudet commented 6 years ago

@kltm Is the 'blocked/upstream label OK for this ?

hdrabkin commented 6 years ago

Thanks I'm still scratching my head "the mapping to Q04714 has subsequently been deleted from the MGI-supplied file. " Q04714 is not in our system. If If you put that id into MGI you get nothing back. It is unclear to me what is meant by MGI-supplied file? I ask because I'd like to find others if they exist.

hdrabkin commented 6 years ago

Ok, I just found more by searching in Amigo for other Mus taxa The annotations are being attributed to MGI but MGI cannot make these annotations to these specific taxa, and so must also be included in Tony's explanation. 17 annotations to M. spretus for Hgprt and a lot more (333 total). 96 for another one. @tonysawfordebi can you tell me what ids you have for Hgprt from MGI? I assume one of them is an incorrect assignment.

pgaudet commented 6 years ago

@hdrabkin Does MGI have a UniProt mapping file to other mouse species ? It looks like it's the mapping file that caused the problems, not the actual annotations.

Pascale

hdrabkin commented 6 years ago

I see this http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt I'm looking for non-musculus ids for Hgprt now.

hdrabkin commented 6 years ago

Annotations for M. spretus for Mid1 The report file has these associated with this gene; none of these appear to be spretus. O70583 Q6PD02 Q3UXC7 Q3TVH5 B1AV01 B1AUZ9 So the spretus annotations for Mid1 should be removed also.

pgaudet commented 4 years ago

@hdrabkin @alexsign This is not yet fixed- MGI:88127 is still mapped to both P01887 (Mus musculus ) and Q04714 (Mus spretus).

From @hdrabkin 's comment this is not what we want. I am not sure whether or not it impacts annotations, but see for eg PMID:16299293 - are there strains from both species being tested, or do we have additional annotations due to the spurious mapping ?

Thanks, Pascale

magrane commented 4 years ago

Alex passed on this issue for me to have a look at. As far I understand from reading through the ticket, the source of the problem is an incorrect MGI cross-reference in Q04714 (Mus spretus). I've removed this from the UniProt record which should address the problem as the GO terms will no longer be associated with the spretus entry. This change will take a while to filter through. The updated UniProt record will be public as part of UniProt release 2020_02 on 22nd April. The problematic GO terms will probably be removed before then, probably some time in March.

pgaudet commented 4 years ago

HI @magrane

Thanks for dealing with this one! Do you have a way to look for IDs that have been mapped twice like this one ? Because I think many annotations (if not all?) manually assigned to Mus spretus have the same problem. For example: P20765 has a link to MGI:96529.

This may also be the case for other mouse spp: Mus musculus molossinus Mus spicilegus Mus caroli

As far as I know only Mus musculus should have cross-links to MGI entries. @hdrabkin please confirm

Thanks, Pascale

hdrabkin commented 4 years ago

It is the only taxon ID for mouse (10090) that I see in the gaf or gpi files

pgaudet commented 4 years ago

OK, so these would come from incorrect xrefs in the UniProt entries then

@alexsign @magrane Can you please remove them ? Only M. musculus entries should link out to MGI

Thanks, Pascale

magrane commented 4 years ago

@hdrabkin

Hi Harold, Can you confirm that MGI records are only applicable to Mus musculus? I'm not talking about what's in the gaf/gpi files but this is a more general question of what we should link to from UniProt. We have about 100 UniProt records for mouse species other than M.musculus which have a link to MGI. We can remove these if this is not appropriate but can you confirm that these MGI links should be removed from UniProt? Thanks!

hdrabkin commented 4 years ago

Hi Michelle; can you provide a file with this 100 ids? I ask because I have just grepped our load file (uniprotmus.dat) and cannot find spretus. (taxon 10096), although I can find 10090 (musculus). (the code for the load predates my mgd contact!

magrane commented 4 years ago

Hi Harold, I've added a file here with a list of Swiss-Prot ACs for mouse entries which have an MGI xref but are not from taxon 10090. Let me know if the MGI xrefs should be removed from these. non_10090.txt

pgaudet commented 4 years ago

@magrane I cannot get the file - can you paste the list or put it in a Google doc ? (am I the only one having this issue?)

magrane commented 4 years ago

Here's the list: O35524 P48057 Q5TM83 Q04714 Q8R4S5 O35521 Q64531 O35522 Q8R4S7 Q8R4S2 Q8R4S4 Q08867 Q8R4S6 Q6H1L8 Q62563 P82457 Q62565 P27119 Q9QZ71 P82456 P63240 A7XZ53 Q9R032 O35893 P49431 Q9R031 Q921C6 Q7TNN8 P20765 Q9QX22 P26595 O08615 Q2L4X1 Q63969 D3KU67 P82185

hdrabkin commented 4 years ago

We do not load most of these into MGI because they are not in the uniprotmus.dat file except for these 6

Q5TM83 loads to Nanog NCBI_TaxID=57486 Mus musculus molossinus; Q8R4S5 and Q8R4S6 loads to Ahr NCBI_TaxID=57486 O35522 loads to Psmb9 NCBI_TaxID=35531; Mus musculus bactrianus Blyth, 1846 Q9R031 loads to Xpr1 NCBI_TaxID=10091; Mus musculus castaneus Q2L4X1 loads to Bzw2 NCBI_TaxID=57486; Mus musculus molossinus; D3KU67 loads to Asmt NCBI_TaxID=57486 Mus musculus molossinus;

Example: D3KU67, in the uniprotmus.dat record is clearly marked OX NCBI_TaxID=57486; it's Uniprot id is ASMT_MUSMM (musculus)? But there is mapping to the MGI gene Asmt in the file. How was this done?

However, Most are not in the load file Not sure why they are not in the load file and the 6 above are. And there are no 10096

like P82185, at uniprot appear associated with Mus musculus, and no other mouse strain But I'm wondering if by the name, DCAM2_MUSSP, that is is suppose to be spretus and not musculus. There are no 10096 (Mus spretus) in the file.

Bottom line: This is NOT a GO issue. This should be moved to another discussion. The GO annotation issue was

  1. protein@GO entries made for non-musculus protein by transferring the musculus annotations. This was very wrong because the paper used was using B6 mice, which are musculus. I assume this was some automated transfer via UniProt as it mapped anything annotated to the gene in our gaf, to any UniProt that is associated with that gene. I would feel better if the transfer were only done to the reference proteome protein

Michelle, I'll have to have Judy weigh in as to what she would prefer (ie, removing the MGI links)

Harold

hdrabkin commented 4 years ago

Ok, @magrane Judy says to leave the xefs to those 100. It's not a huge issue at this time. At some point we will need as a db to explicitly track these things as more and more stains get added etc

pgaudet commented 4 years ago

OK. But these entries are getting IDA annotations from MGI: http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q08867

Which I assumed were not assigned.

@hdrabkin that doesn't seem OK ?

hdrabkin commented 4 years ago

The annotations to Q08867 should NOT be made because they reference a paper used Black 6 (10090); Apparently at UniProt, they have a record that maps to both 10090 and 10096. If UniProt removes the 10096 mapping that should solve it. We only load the P01887; the Q08867 id is not in MGI at all. Only the P01887 is getting an MGI IDA annotation, actually, the annotation was made at MGI at the gene level. The automatic stuff UniProt does really should NOT attribute the annotation to MGI for the Q08867.

krchristie commented 4 years ago

The annotations to Q08867 should NOT be made because they reference a paper used Black 6 (10090); Apparently at UniProt, they have a record that maps to both 10090 and 10096. If UniProt removes the 10096 mapping that should solve it. We only load the P01887; the Q08867 id is not in MGI at all. Only the P01887 is getting an MGI IDA annotation, actually, the annotation was made at MGI at the gene level. The automatic stuff UniProt does really should NOT attribute the annotation to MGI for the Q08867.

Seems that we might want to consider rules for transferring annotations from one closely related species to another at this kind of level. Not only did MGI not make this annotation, it seems false to say that there is experimental information for Mus spretus when the experiments were done in the laboratory mouse (musculus) and not in spretus. Transferring this annotation would come across differently if it was labelled as sequence similarity, rather than experimental.

hdrabkin commented 4 years ago

@krchristie yes indeed; and the ISS should then be attributed to UniProt/GOA, and not the MGI

pgaudet commented 4 years ago

Thanks @krchristie and @hdrabkin

I have only spot-checked but as far as I can tell all the proteins listed my @magrane have the same issue. The UniProt-MGI mapping somehow seems to bring in annotations in both records.

Thanks, Pascale

magrane commented 4 years ago

As I understand from one of Tony's comments earlier in this thread, the transfer of GO annotations from Mus musculus to non-musculus species is happening because of something on our side which transfers GO terms assigned by MGI for a particular MGI identifier to all entries in UniProt which contain an xref to the same MGI identifier.

There are 2 ways we could resolve this: 1) remove MGI links from non-musculus species but Judy doesn't seem to want us to do this 2) change our transfer process so that it only transfers annotations within taxon 10090.

@hdrabkin Can you confirm that transferring only within taxon 10090 is the correct thing to do? Does MGI ever assign GO terms to mouse taxons other than 10090?