geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
46 stars 89 forks source link

Investigate drop in RGD annotations #1186

Closed cmungall closed 5 years ago

cmungall commented 5 years ago

dropped 449680 -> 308694 lines in gaf

I don't see any overall pattern in the drop

In some cases it looks like valid QC, e.g. many annotations to 'protein binding' gone.

I see some PMIDs have dropped out altogether, e.g.

http://amigo.geneontology.org/amigo/reference/PMID:9925767

image

The pub is from 1999 and I have not read it but I see no reason to suspect it is invalid and the annotations purged?

If there has been additional QC done on some of these and there is a conscious decision to deem the PMID not useful for GO curation would be awesome to record this somewhere, this is Frederic Bastian's proposal.

Or potentially this is an extreme redundancy trimming...?

pgaudet commented 5 years ago

Some more stats:

Current release Next release
683 EXP 250 EXP
48 HDA 41 HDA
59858 IBA 58535 IBA
215 IC 179 IC
30594 IDA 24341 IDA
127682 IEA 56498 IEA
10507 IEP 10159 IEP
313 IGI 243 IGI
5 IKR 5 IKR
9417 IMP 8102 IMP
7521 IPI 3886 IPI
168257 ISO 132890 ISO
23847 ISS 3817 ISS
640 NAS 563 NAS
6604 ND 6589 ND
3464 TAS 2571 TAS

Current release Next release
435 AgBase 186 AgBase
93 Alzheimers_University_of_Toronto 22 Alzheimers_University_of_Toronto
640 ARUK-UCL 511 ARUK-UCL
2011 BHF-UCL 1029 BHF-UCL
70 CACAO 39 CACAO
439 CAFA 300 CAFA
4 DFLAT 1 DFLAT
1 dictyBase 1 dictyBase
76418 Ensembl 13929 Ensembl
7 FlyBase 5 FlyBase
6188 GOC 5145 GOC
59919 GO_Central 58550 GO_Central
542 HGNC 175 HGNC
1508 IntAct 137 IntAct
14424 InterPro 11603 InterPro
816 MGI 534 MGI
88 NTNU_SB 68 NTNU_SB
823 ParkinsonsUK-UCL 360 ParkinsonsUK-UCL
10 PINC 10 PINC
898 Reactome 252 Reactome
227163 RGD 185959 RGD
2646 SynGO 1563 SynGO
56 SynGO-UCL 39 SynGO-UCL
54288 UniProt 28095 UniProt
14 WB 9 WB
154 YuBioLab 147 YuBioLab

Pascale

cmungall commented 5 years ago

1775 PMIDs have been dropped in the latest RGD GAF

pgaudet commented 5 years ago

Spot checking a specific protein (the one with the largest changes): RGD:70487 we went from 550 annotations to 260. Some redundant IEA/ISO are removed, which is nice, but also, many annotations from external sources (such as BHF) are not in the new dataset anymore, for example RGD:70487 GO:0010629 BHF-UCL in not in the new file, but it’s still in Protein2GO.

Pascale

pgaudet commented 5 years ago

@slaulederkind @gthayman

pgaudet commented 5 years ago

@jrsjrs @tutajm Thanks for the quick response.

tutajm commented 5 years ago

In RGD, we have implemented additional QC to prevent submission of a GAF file with size differing substantially from previously submitted GAF files.

I apologize everyone for the problem.