Why are MGI assoc files a mix of gene and protein?

cmungall commented 9 years ago

E.g. in the MGI GAF

MGI     MGI:1917015     1500004F05Rik           GO:0008150      MGI:MGI:2156816|GO_REF:0000015  ND              P       RIKEN cDNA 1500004F05 gene              gene    taxon:10090     20120430        MGI             
MGI     MGI:1923755     1500009C09Rik           GO:0003674      MGI:MGI:2156816|GO_REF:0000015  ND              F       RIKEN cDNA 1500009C09 gene              protein taxon:10090     20100209        MGI             VEGA:OTTMUSP00000045521

What makes the 2nd one a protein and the 1st a gene?

The page on JAX is kind of odd http://www.informatics.jax.org/marker/MGI:1923755

"Feature Type protein coding gene"

Yet it's an ortholog of a lincRNA

I guess that's what conflict means

Looks like if there is a conflict, that results in the field in GO being 'protein' rather than 'gene'. But this is weird as the conflict apparently arises as the fact this is ncRNA...?

Either way: don't trust the type field in the MGI GAF

hdrabkin commented 9 years ago

The UniProt ids associated with t1500009C09Rik are all "unreviewed. Marker subtype is determined based on load scripts from three sources: Vega, Ensembl, and NCBI Because long ncRNAs are located and transcribed within the intergenic stretches, the majority are transcribed as complex, interlaced networks of overlapping sense and antisense transcripts that often includes protein-coding genes.Thi complexity of these foci frustrates easy evaluation. Note: Two out of three sequence providers say this is a protein coding gene. Two of them are predicted to be about 70-80aa long and begin with methionine, but one has a longer reading frame but no methionin start. The first Riken id you use, 1500004F05Rik, has no UniProt associated whereas the second one has the three UniProt sequences. 1500004F05Rik is an unclassifed gene (none of the sequence providers could make a call).

Only NCBI lists it as ncRNA, but Vega and Ensembl call it a protein coding gene. MGI protein coding gene MGI:1923755 (based on Ensembl + Vega)

VEGA Gene Model PUTATIVE_protein_coding OTTMUSG00000033544 Ensembl Gene Model protein_coding ENSMUSG00000068099 NCBI Gene Model ncRNA 76505

Often times when we get new build data, some of these weird ones get reclassified. Also,If you look at our gp2rna.mgi file, some of these ids have UniProt ids associated with the gene.

pgaudet commented 2 years ago

In the current release candidate, we have

protein: 202
gene_product: 288
gene: 386
protein_coding_gene 21790

From @hdrabkin The difference between ‘protein’ and 'protein-coding gene':

‘protein’ means the annotation is directly to a PRO id, a proteoform
protein-coding gene means an annotation directly to the protein coding gene.

@hdrabkin can you explain what are genes and gene products?

hdrabkin commented 2 years ago

In the amigo staging site the "gene_product" types seem to have only ND annotations and have gene names with 'opposite strand', Several have 'antisense lncRNA gene' or other 'antisense' biotypes, but they also have biotype conflicts. Many are 'predicted genes'

hdrabkin commented 2 years ago

When we looked at the staging amigo site last time you asked, everything looks fine. https://amigo-staging.geneontology.io/amigo/search/annotation

hdrabkin commented 2 years ago

This ticket is really old. @ukemi could you clarify what we do now.

ukemi commented 2 years ago

We use the SO identifier for the object that has been annotated. If the object is a represented by an MGI identifier it get assigned the appropriate SO type for the identifier's classification in MGI. If it is a PRO id, the object annotated is specifically to a protein. They types are quite literal and are assigned by the MGI sequence curation team. A biotype conflict indicates that different sequence curation groups have curated the sequences associated with a given marker in different ways, but we take the decision made by the MGI group.

pgaudet commented 2 years ago

Thanks- the explanation was clear between protein and protein-coding; I am not clear what genes and gene products correspond to, what I understand from @ukemi 's comment is that these cannot be assigned more precisely.

'Gene' seems odd; presumably any function annotated in GO would at least be to a gene product?

ukemi commented 2 years ago

Have a look at them in MGI. Over on the right hand side, see what kind of feature types they are.

ukemi commented 2 years ago

It looks like many of the 'genes' are predicted genes that are essentially placeholders and get the root annotations for now. They my disappear or get promoted, but sometimes people identify them in their analyses. In MGI they are 'unclassified genes'.

pgaudet commented 2 years ago

ok thanks! I think this can close.

geneontology / neo

Why are MGI assoc files a mix of gene and protein? #3