geneontology / go-releases

Tasks and notes for monthly GO releases
0 stars 0 forks source link

QC Dicty files for new GOA-GOC data exchange pipeline #97

Open pgaudet opened 2 weeks ago

pgaudet commented 2 weeks ago

Differences between GOC and GOA files

GOC file: snapshot from 2024-11-04T20:23 GOA file: GOA ftp, file generated 2024-10-08 11:09 Files are on the GO Google drive

Annotations be evidence: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Ev | GOA |   | Ev | GOC |   | diff -- | -- | -- | -- | -- | -- | -- IEA | 37066 |   | IEA | 42618 |   | 5552 IBA | 17890 |   | IBA | 18298 |   | 408 IDA | 4210 |   | IDA | 4324 |   | 114 IMP | 3183 |   | IMP | 3227 |   | 44 IPI | 1200 |   | IPI | 1230 |   | 30 TAS | 436 |   | TAS | 447 |   | 11 HMP | 81 |   | HMP | 82 |   | 1 IGI | 569 |   | IGI | 570 |   | 1 HDA | 667 |   | HDA | 667 |   | 0 HEP | 127 |   | HEP | 127 |   | 0 IC | 145 |   | IC | 145 |   | 0 IEP | 230 |   | IEP | 230 |   | 0 IGC | 79 |   | IGC | 79 |   | 0 IKR | 1 |   | IKR | 1 |   | 0 NAS | 6 |   | NAS | 6 |   | 0 ISS | 3451 |   | ISS | 3449 |   | -2 ND | 6309 |   | ND | 6246 |   | -63

OK:

pgaudet commented 2 weeks ago

Annotations by entity types:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Type | GOA |   | Type | GOC | diff -- | -- | -- | -- | -- | -- misc_RNA | 13 |   |   |   | -13   |   |   | gene_product | 23 | 23 protein | 75339 |   | protein | 80840 | 5501 RNase_MRP_RNA | 3 |   | RNase_MRP_RNA | 3 | 0 RNase_P_RNA | 6 |   | RNase_P_RNA | 7 | 1 rRNA | 102 |   |   |   | -102   |   |   | scRNA | 2 | 2 snoRNA | 38 |   | snoRNA | 41 | 3 snRNA | 39 |   | snRNA | 64 | 25 sRNA | 26 |   |   |   | -26 tRNA | 84 |   | tRNA | 766 | 682

Example missing tRNA: URS0000606E5E_352472

In GOC file, but not in GOA file. In QuickGO: https://www.ebi.ac.uk/QuickGO/annotations?geneProductId=URS0000606E5E_352472

Issue may be that RNA Central has annotations to both the species and the strain for Dicty - see QuickGO RNA central annotation stats:

image
pgaudet commented 2 weeks ago

There are also 56 UniProt IDs that cannot be mapped back to dictyBase IDs:

Actually, these are only IBAs (checked by comparing data in excel)

3 are reviewed, 53 are TrEMBL

P0DPA1 >> reviewed >> not in GOA gpi file but present in GOC GPI file Q54VQ0 >> reviewed Q9Y0C9 >> reviewed Q54B01 Q54BT4 Q54DQ0 Q54DQ5 Q54E93 Q54EA4 Q54EC3 Q54EE1 Q54F27 Q54FB3 Q54GN0 Q54HF3 Q54HK4 Q54I11 Q54JP0 Q54LN8 Q54LU1 Q54M43 Q54M55 Q54M65 Q54NR9 Q54S21 Q54S28 Q54S41 Q54SC3 Q54TX0 Q54U93 Q54UH4 Q54VH4 Q54VT0 Q54WK8 Q54XX0 Q54Y46 Q552I0 Q553P9 Q555U6 Q556V2 Q55BV3 Q55DL6 Q55E24 Q55EB2 Q55EB5 Q55EQ5 Q55EY6 Q55EZ8 Q55F07 Q55FL2 Q869S2 Q86AG5 Q86H70 Q86HR8 Q86HT8 Q86L50

pgaudet commented 1 week ago