Open pgaudet opened 6 years ago
I have been complaining about both (forever).
BUT all of these annotations are present twice !
These look like different Mus species, no? Or am I missing something else?
Doh!
@vanaukenk Well that's surprising. The same paper tested two mice species ?
If there is to be a discussion of what is going on with upstream data, I would suggest opening a "cause" ticket in go-annotation
--I'm pretty much the only active person around here.
@hdrabkin Can you please have a look and let us know whether this is right ?
First glance: the gene names are suppose to be the same for the same symbol; B2m, beta-2 microglobulin; Something weird. In our Ei (which is looking at our db), we only have 4 annotations using this PMID; and we have nothing with a panther id in it . I just grepped our current annotation file. It is there 4 times, not 8.
@hdrabkin Do you export annotations to Mus spretus? This is the first time I see them.
I guess they COULD be in the gpad and not the gaf, but I don't see how we would store them in our DB (that is, I can't look at them I just grepped gpa and gaf and cannot find an instance of taxon:10096
We do append things to the gpad/gafs that we get from GOA that we can't load but then I would see the annotations when I grepped for the PMID; but there are only 4 instances of the PMID. I just love Mondays.
There are 8 in P2GO; @tonysawfordebi any idea what is happening ?
We don't have them in MGI. Can you tell me if an MGI curator made them? I don't have access to p2go myself.
If not they would be in our GOA load and stripped because we would consider them duplicate annotations anyways, as reflected in the fact that they are not in the GAF and GPAD we export (I can only find 8); I don't see how AMIGO would have more than we have in our gaf and gpad.
wondering if the GOA_mouse gaf we load only has taxon 10090? Yes, the file we get does NOT have 10096; confirmed; not in goa_mouse.gaf.gz
It shows up the same as the other MGI annotations - at any rate not like they were done in P2GO.
I have no idea what so ever; we don't have them here in our db; I don't know what is going on! I'm gonna go git (haha) a beer.
Well, lets see what @tonysawfordebi says tomorrow.
Looking at the upstream data for AmiGO:
sjcarbon@moiraine:/tmp$:( grep 1927183 mgi.gaf
MGI MGI:88127 B2m GO:0006826 MGI:MGI:1203460|PMID:9531620 IMP MGI:MGI:1927183 P beta-2 microglobulin beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920 MGI
MGI MGI:88127 B2m GO:0033216 MGI:MGI:1203460|PMID:9531620 IMP MGI:MGI:1927183 P beta-2 microglobulin beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920 MGI
MGI MGI:88127 B2m GO:0045646 MGI:MGI:1203460|PMID:9531620 IMP MGI:MGI:1927183 P beta-2 microglobulin beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920 MGI
MGI MGI:88127 B2m GO:0071283 MGI:MGI:1203460|PMID:9531620 IMP MGI:MGI:1927183 P beta-2 microglobulin beta2-m|beta 2 microglobulin|Ly-m11 protein taxon:10090 20160920 MGI
sjcarbon@moiraine:/tmp$:) grep 1927183 goa_uniprot_all_noiea.gaf
UniProtKB Q04714 B2m GO:0006826 PMID:9531620 IMP MGI:MGI:1927183 P Beta-2-microglobulin B2m protein taxon:10096 20160920 MGI
UniProtKB Q04714 B2m GO:0033216 PMID:9531620 IMP MGI:MGI:1927183 P Beta-2-microglobulin B2m protein taxon:10096 20160920 MGI
UniProtKB Q04714 B2m GO:0045646 PMID:9531620 IMP MGI:MGI:1927183 P Beta-2-microglobulin B2m protein taxon:10096 20160920 MGI
UniProtKB Q04714 B2m GO:0071283 PMID:9531620 IMP MGI:MGI:1927183 P Beta-2-microglobulin B2m protein taxon:10096 20160920 MGI
Sounds like this is an upstream issue from amigo. Open a ticket in go-annotation as Seth suggests, assign to someone who can go into p2go and find the provenance of these.
I just want to know that I understand it. So there are 4 annotations supplied by goa in the all_noIEA file that are for the taxon 10096? So that is why this is not in our gaf; it's not from MGI, even though it is being mapped to the MGI gene but for a different species? BTW, looking at PMID:9531620, in the M&M, they state that they are using C57BL/6 (B6). This is Mus musculus. The alleles used in the with are from Mus Musculus. I suggest these be deleted unless I'm missing something here.
Looking in the MGI curation interface (view is sorted by paper to show all annotations from J:47457, aka PMID:9531620), MGI (specifically Dmitry) made four annotations from this paper, dated 2016-09-20.
Then, in P2GO, there are now 8 annotations all credited to MGI with the same date (requires mouseover of calendar icon in P2GO to view date)
So, it looks to me like the issue is in how the original MGI annotations were propagated to UniProt. It seems incorrect to propagate these musculus annotations to spretus with the same experimental evidence code, so I agree that these should be removed.
MGI can not remove the spretus annotations because we did not make them in our interface and since P2GO says they are from MGI, they cannot be edited in P2GO.
They are going to need to be removed by whatever pipeline duplicated MGI's annotations to spretus.
Yes I originally commented the same on the ticket; really weird.
From: Karen R Christie notifications@github.com Reply-To: geneontology/amigo reply@reply.github.com Date: Thursday, August 16, 2018 at 12:05 PM To: geneontology/amigo amigo@noreply.github.com Cc: me Harold.Drabkin@jax.org, Mention mention@noreply.github.com Subject: Re: [geneontology/amigo] Duplicate annotations (#531)
Looking in the MGI curation interface (view is sorted by paper to show all annotations from J:47457, aka PMID:9531620), MGI (specifically Dmitry) made four annotations from this paper, dated 2016-09-20.
[20180816-mgi-ei-sourceannots]https://user-images.githubusercontent.com/10533218/44217413-43a01e00-a12c-11e8-86e8-f41ebeb11e9a.jpg
Then, in P2GO, there are now 8 annotations all credited to MGI with the same date (requires mouseover of calendar icon in P2GO to view date)
[20180816-p2goannots]https://user-images.githubusercontent.com/10533218/44217429-4c90ef80-a12c-11e8-8854-8fc429a9bb99.jpg
So, it looks to me like the issue is in how the original MGI annotations were propagated to UniProt. It seems incorrect to propagate these musculus annotations to spretus with the same experimental evidence code, so I agree that these should be removed.
MGI can not remove the spretus annotations because we did not make them in our interface and since P2GO says they are from MGI, they cannot be edited in P2GO.
They are going to need to be removed by whatever pipeline duplicated MGI's annotations to spretus.
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). If you believe you received this email by mistake, please notify the sender by return email as soon as possible.
@pgaudet I was away last week, hence the delayed reply - sorry!
I've looked at the case you mentioned, and have an explanation.
In the original MGI GPAD, there are indeed just four annotations, all to MGI:MGI:88127.
However, in the mapping file that we use to translate MGI identifiers to UniProtKB identifiers, and whose content is derived from UniProt cross-references, this particular MGI ID is mapped to two UniProtKB IDs:
MGI:MGI:88127 UniProtKB:P01887;UniProtKB:Q04714
Hence when we import the MGI annotations, those four annotations to MGI:MGI:88127 are expanded to eight, four to UniProtKB:P01887, and four to UniProtKB:Q04714.
Hi @tonysawfordebi
Welcome back !
How are these mappings generated ? It's surprising that the same ID goes to two proteins from different species.
Thanks, Pascale
@pgaudet Thanks :) I don't know - that would be a question for the UniProt folk. If the relevant people are around, I'll ask them.
BTW, in P2G if you click on the little chain link icon at the right hand end of an imported annotation you can see the details of the original - pre-ID-translation - version of the annotation:
I've had a chat with the relevant UniProt authority, and it seems that in a previous version of the mapping file that was supplied by MGI to UniProt MGI:88127 was indeed mapped to P01887 and Q04714.
However, the mapping to Q04714 has subsequently been deleted from the MGI-supplied file. This change will eventually filter through into the mapping files that we use in the import process, but unfortunately we can't specify any timescale for this, as it requires action on the part of Swiss-Prot curators to integrate this change into UniProt.
Great, thanks. So I guess this is fixed. I'd like to see the change in AmiGO before closing.
@hdrabkin just want to make sure you've seen this.
Thanks, Pascale
@kltm Is the 'blocked/upstream label OK for this ?
Thanks I'm still scratching my head "the mapping to Q04714 has subsequently been deleted from the MGI-supplied file. " Q04714 is not in our system. If If you put that id into MGI you get nothing back. It is unclear to me what is meant by MGI-supplied file? I ask because I'd like to find others if they exist.
Ok, I just found more by searching in Amigo for other Mus taxa The annotations are being attributed to MGI but MGI cannot make these annotations to these specific taxa, and so must also be included in Tony's explanation. 17 annotations to M. spretus for Hgprt and a lot more (333 total). 96 for another one. @tonysawfordebi can you tell me what ids you have for Hgprt from MGI? I assume one of them is an incorrect assignment.
@hdrabkin Does MGI have a UniProt mapping file to other mouse species ? It looks like it's the mapping file that caused the problems, not the actual annotations.
Pascale
I see this http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt I'm looking for non-musculus ids for Hgprt now.
Annotations for M. spretus for Mid1 The report file has these associated with this gene; none of these appear to be spretus. O70583 Q6PD02 Q3UXC7 Q3TVH5 B1AV01 B1AUZ9 So the spretus annotations for Mid1 should be removed also.
@hdrabkin @alexsign This is not yet fixed- MGI:88127 is still mapped to both P01887 (Mus musculus ) and Q04714 (Mus spretus).
From @hdrabkin 's comment this is not what we want. I am not sure whether or not it impacts annotations, but see for eg PMID:16299293 - are there strains from both species being tested, or do we have additional annotations due to the spurious mapping ?
Thanks, Pascale
Alex passed on this issue for me to have a look at. As far I understand from reading through the ticket, the source of the problem is an incorrect MGI cross-reference in Q04714 (Mus spretus). I've removed this from the UniProt record which should address the problem as the GO terms will no longer be associated with the spretus entry. This change will take a while to filter through. The updated UniProt record will be public as part of UniProt release 2020_02 on 22nd April. The problematic GO terms will probably be removed before then, probably some time in March.
HI @magrane
Thanks for dealing with this one! Do you have a way to look for IDs that have been mapped twice like this one ? Because I think many annotations (if not all?) manually assigned to Mus spretus have the same problem. For example: P20765 has a link to MGI:96529.
This may also be the case for other mouse spp: Mus musculus molossinus Mus spicilegus Mus caroli
As far as I know only Mus musculus should have cross-links to MGI entries. @hdrabkin please confirm
Thanks, Pascale
It is the only taxon ID for mouse (10090) that I see in the gaf or gpi files
OK, so these would come from incorrect xrefs in the UniProt entries then
@alexsign @magrane Can you please remove them ? Only M. musculus entries should link out to MGI
Thanks, Pascale
@hdrabkin
Hi Harold, Can you confirm that MGI records are only applicable to Mus musculus? I'm not talking about what's in the gaf/gpi files but this is a more general question of what we should link to from UniProt. We have about 100 UniProt records for mouse species other than M.musculus which have a link to MGI. We can remove these if this is not appropriate but can you confirm that these MGI links should be removed from UniProt? Thanks!
Hi Michelle; can you provide a file with this 100 ids? I ask because I have just grepped our load file (uniprotmus.dat) and cannot find spretus. (taxon 10096), although I can find 10090 (musculus). (the code for the load predates my mgd contact!
Hi Harold, I've added a file here with a list of Swiss-Prot ACs for mouse entries which have an MGI xref but are not from taxon 10090. Let me know if the MGI xrefs should be removed from these. non_10090.txt
@magrane I cannot get the file - can you paste the list or put it in a Google doc ? (am I the only one having this issue?)
Here's the list: O35524 P48057 Q5TM83 Q04714 Q8R4S5 O35521 Q64531 O35522 Q8R4S7 Q8R4S2 Q8R4S4 Q08867 Q8R4S6 Q6H1L8 Q62563 P82457 Q62565 P27119 Q9QZ71 P82456 P63240 A7XZ53 Q9R032 O35893 P49431 Q9R031 Q921C6 Q7TNN8 P20765 Q9QX22 P26595 O08615 Q2L4X1 Q63969 D3KU67 P82185
We do not load most of these into MGI because they are not in the uniprotmus.dat file except for these 6
Q5TM83 loads to Nanog NCBI_TaxID=57486 Mus musculus molossinus; Q8R4S5 and Q8R4S6 loads to Ahr NCBI_TaxID=57486 O35522 loads to Psmb9 NCBI_TaxID=35531; Mus musculus bactrianus Blyth, 1846 Q9R031 loads to Xpr1 NCBI_TaxID=10091; Mus musculus castaneus Q2L4X1 loads to Bzw2 NCBI_TaxID=57486; Mus musculus molossinus; D3KU67 loads to Asmt NCBI_TaxID=57486 Mus musculus molossinus;
Example: D3KU67, in the uniprotmus.dat record is clearly marked OX NCBI_TaxID=57486; it's Uniprot id is ASMT_MUSMM (musculus)? But there is mapping to the MGI gene Asmt in the file. How was this done?
However, Most are not in the load file Not sure why they are not in the load file and the 6 above are. And there are no 10096
like P82185, at uniprot appear associated with Mus musculus, and no other mouse strain But I'm wondering if by the name, DCAM2_MUSSP, that is is suppose to be spretus and not musculus. There are no 10096 (Mus spretus) in the file.
Bottom line: This is NOT a GO issue. This should be moved to another discussion. The GO annotation issue was
Michelle, I'll have to have Judy weigh in as to what she would prefer (ie, removing the MGI links)
Harold
Ok, @magrane Judy says to leave the xefs to those 100. It's not a huge issue at this time. At some point we will need as a db to explicitly track these things as more and more stains get added etc
OK. But these entries are getting IDA annotations from MGI: http://amigo.geneontology.org/amigo/gene_product/UniProtKB:Q08867
Which I assumed were not assigned.
@hdrabkin that doesn't seem OK ?
The annotations to Q08867 should NOT be made because they reference a paper used Black 6 (10090); Apparently at UniProt, they have a record that maps to both 10090 and 10096. If UniProt removes the 10096 mapping that should solve it. We only load the P01887; the Q08867 id is not in MGI at all. Only the P01887 is getting an MGI IDA annotation, actually, the annotation was made at MGI at the gene level. The automatic stuff UniProt does really should NOT attribute the annotation to MGI for the Q08867.
The annotations to Q08867 should NOT be made because they reference a paper used Black 6 (10090); Apparently at UniProt, they have a record that maps to both 10090 and 10096. If UniProt removes the 10096 mapping that should solve it. We only load the P01887; the Q08867 id is not in MGI at all. Only the P01887 is getting an MGI IDA annotation, actually, the annotation was made at MGI at the gene level. The automatic stuff UniProt does really should NOT attribute the annotation to MGI for the Q08867.
Seems that we might want to consider rules for transferring annotations from one closely related species to another at this kind of level. Not only did MGI not make this annotation, it seems false to say that there is experimental information for Mus spretus when the experiments were done in the laboratory mouse (musculus) and not in spretus. Transferring this annotation would come across differently if it was labelled as sequence similarity, rather than experimental.
@krchristie yes indeed; and the ISS should then be attributed to UniProt/GOA, and not the MGI
Thanks @krchristie and @hdrabkin
I have only spot-checked but as far as I can tell all the proteins listed my @magrane have the same issue. The UniProt-MGI mapping somehow seems to bring in annotations in both records.
Thanks, Pascale
As I understand from one of Tony's comments earlier in this thread, the transfer of GO annotations from Mus musculus to non-musculus species is happening because of something on our side which transfers GO terms assigned by MGI for a particular MGI identifier to all entries in UniProt which contain an xref to the same MGI identifier.
There are 2 ways we could resolve this: 1) remove MGI links from non-musculus species but Judy doesn't seem to want us to do this 2) change our transfer process so that it only transfers annotations within taxon 10090.
@hdrabkin Can you confirm that transferring only within taxon 10090 is the correct thing to do? Does MGI ever assign GO terms to mouse taxons other than 10090?
Hello,
I don't know if similar cases have been reported (I thougth previous reports had more to do with redundant but distinct annotations). These appear completely identical (one with a UniProt ID, one with a MGI ID).
Thanks, Pascale