PROconsortium / PRoteinOntology

Other
14 stars 3 forks source link

GPI terms that don't map to a gene #188

Open nataled opened 4 years ago

nataled commented 4 years ago

The following terms don't map to any gene, but should. Need to look into.

more PRO_GPI.dat | grep taxon:10090 | grep -v protein_complex | perl ~dnatale/get_col_by_char.pl '\t' 2 3 8 9 | grep -v MGI:MGI | more

P0C092-1 mKcnip3/iso:4 Q9Z223-1 mMOCS2L/iso:m2 Q9Z223-2 mMOCS2L/iso:m1 Q9Z223-3 mMOCS2L/iso:m3 000026286 mCox4i[1/2] 000027030 m(PFK[LMP]/iso:1)*1 000027101 fam:mTAB2-mTAB3 000027109 m(CBP-p300) 000029032 mouse protein 000034771 mH2-Aa 000036039 mK63polyUbiq 000049738 mCALM P0DPD9 mEEF1AKMT4-ECE2 Q3ZRW6 mCLUL1 Q3ZRW6-1 mCLUL1/iso:1 PR:Q3ZRW6 P0DN34 mNdufb1 Ensembl:ENSMUSG00000113902 Q6PIU9 mLOC102637099 NCBIGene:102637099 A2KF29 mSmoktcr PR:000025201

Generalized to other organisms, covering organism-families and unions:

perl fsw.pl wv organism-family | grep ^id: | perl ~dnatale/cut.pl 'id: PR:' | perl ~dnatale/grep_in_order.pl PRO_GPI.dat | perl ~dnatale/get_col_by_char.pl '\t' 2 9 | grep -v :

perl fsw.pl wv =union | grep ^id: | perl ~dnatale/cut.pl 'id: PR:' | perl ~dnatale/grep_in_order.pl PRO_GPI.dat | perl ~dnatale/get_col_by_char.pl '\t' 2 9 | grep -v :

hdrabkin commented 4 years ago

Hi @nataled I have looked at all of these The majority of these are grouping terms/organism family terms. We don't have an equivalent and in our GPI for Noctua, there is a need to 1:1 correspondence.

P0C092-1 mKcnip3/iso:4 P0C092 does not map to a mouse gene in MGI; will investigate.

Q9Z223-1 mMOCS2L/iso:m2 Q9Z223 DOES map to Mocs2; it does not Map to Mosc2L; in fact, there IS no Mosc2l in mouse Q9Z223-2 mMOCS2L/iso:m1 ditto Q9Z223-3 mMOCS2L/iso:m3 ditto

000026286 mCox4i[1/2] maps to 2 genes (protein that is a translation product of either the mouse Cox4i1 gene or the mouse Cox4i2 gene,); MGI has no grouping terms

000027030 m(PFK[LMP]/iso:1)*1 Multiple genes (Pfk of liver, muscle, or plate (Pfkl, Pfkm, Pfkp) MGI has no grouping terms.

000027101 fam:mTAB2-mTAB3 mitogen-activated protein kinase kinase kinase 7-interacting protein 2/3 (mouse) Family this would map to two genes, Tab2 and Tab3 000027109 m(CBP-p300) term to mean Translation product of the mouse CREBBP or EP300 genes.

000029032 mouse protein This term brings back ALL mouse proteins in PRO; we won’t map this

000034771 mH2-Aa This PRO is an organism FAMILY term. We don’t do these. But the H2-Aa IS a protein coding gene, The UniProt id P14434 is not in PRO; this would be a child of this term if it existed.

000036039 mK63polyUbiq; Another organism-family term; there are MANY genes in MGI annotated to protein K63-linked deubiquitination;

000049738 mCALM Organism family; includes Calm1, 2, and 3; The individual entries are associated; (there are 3 diferent UniProts each to a different gene).

P0DPD9 mEEF1AKMT4-ECE2 P0DPD9 maps to Gm49333 in MGI

Q3ZRW6 mCLUL1 Q3ZRW6 does not map to an MGI gene. Q3ZRW6 does not map to MGI gene Will investiage Q3ZRW6-1 mCLUL1/iso:1 PR:Q3ZRW6 dido

P0DN34 mNdufb1 Ensembl:ENSMUSG00000113902 maps to Ndufb1 in MGI; the Ensembl id is NOT one of the ones we have (we have ENSMUSTs but no ENSMUSGs) Q6PIU9 mLOC102637099 NCBIGene:102637099 Maps to Aak1 in MGI, not the loc

A2KF29 mSmoktcr PR:000025201 A2KF29 does not map to an MGI gene.; will investigate.

hdrabkin commented 4 years ago

Special case: P0DPD9 mEEF1AKMT4-ECE2; there is NO EEF1AKMT4-ECE2 in mouse; there are two separate genes Ece2 endothelin converting enzyme 2 16 20629851- 20645915 Eef1akmt4 EEF1A lysine methyltransferase 4 16 20611601- 20618869 +

nataled commented 4 years ago

@hdrabkin you are attempting to discern genes from the short label. Do not do this, as the short label is based on the human ortholog. For example, the Mocs2 isoforms (Q9Z223 set) don't map to any gene due to some issue I need to track down, but Q9Z223 itself maps to the very gene you indicate.

nataled commented 4 years ago

Some headway made on these. Here's the current status.

Tweaked a few entries, a few unconnected entries (where the indicated term was unconnected to potential children) and tweaked my code to handle some relatively rare cases to produce the following:

P0C092-1 mKcnip3/iso:4                     == fixed; tweaked code
Q9Z223-1 mMOCS2L/iso:m2                    == fixed; tweaked code
Q9Z223-2 mMOCS2L/iso:m1                    == fixed; tweaked code
Q9Z223-3 mMOCS2L/iso:m3                    == fixed; tweaked code
000026286 mCox4i[1/2]
000027030 m(PFK[LMP]/iso:1)*1              == fixed; tweaked entry
000027101 fam:mTAB2-mTAB3                  == fixed; tweaked unconnected entry
000027109 m(CBP-p300)                      == fixed; tweaked unconnected entry
000029032 mouse protein                == unclear how to handle
000034771 mH2-Aa                           == fixed; added gene
000036039 mK63polyUbiq                 == unclear how to handle
000049738 mCALM                            == fixed; tweaked code
P0DPD9 mEEF1AKMT4-ECE2                     == fixed; Note that UniProtKB has this mapped to the wrong gene (Ece2)
Q3ZRW6 mCLUL1                              ==   AWAITS HD INVESTIGATION
Q3ZRW6-1 mCLUL1/iso:1 PR:Q3ZRW6            ==   AWAITS HD INVESTIGATION
P0DN34 mNdufb1 Ensembl:ENSMUSG00000113902  == fixed; UniProtKB previously lacked the MGI gene
Q6PIU9 mLOC102637099 NCBIGene:102637099    == fixed; MGI missing from UniProtKB; hand edited
A2KF29 mSmoktcr PR:000025201               ==   AWAITS HD INVESTIGATION            
hdrabkin commented 4 years ago

For Q3ZRW6 mCLUL1 this far. Mouse does not have a clusterin like gene. Rat does but has no mouse ortholog. Running NCBI Blast against all mouse reference proteome set: Highest hit is 63% to LOW QUALITY PROTEIN: clusterin-like protein 1 [Mus caroli] Sequence ID: XP_021038041.1Length: 30; ANy hits to Mus musculus to XPs belong to Clu. However, but low(<63%) matches to XPs

However, running align against the two using the UniProt align tool gives an alignment, but with only 21.47 % identity. There is only one reference for this 'reviewed ' protein, a genbank submission: submitted (JUN-2004) to the EMBL/GenBank/DDBJ databases Cited for: NUCLEOTIDE SEQUENCE [MRNA].

Finding that, there IS a reference, PMID:10675623, to the EMBL nucleotide entry (AAT81477). MGI has this reference, but it is NOT associated with any gene in MGI; it's just in the DB not indexed to any gene. The reference is now 20years old (2000). I'm going to ask our Nomen expert to trace if possible.