Closed cmungall closed 2 years ago
The UniProt ids associated with t1500009C09Rik are all "unreviewed. Marker subtype is determined based on load scripts from three sources: Vega, Ensembl, and NCBI Because long ncRNAs are located and transcribed within the intergenic stretches, the majority are transcribed as complex, interlaced networks of overlapping sense and antisense transcripts that often includes protein-coding genes.Thi complexity of these foci frustrates easy evaluation. Note: Two out of three sequence providers say this is a protein coding gene. Two of them are predicted to be about 70-80aa long and begin with methionine, but one has a longer reading frame but no methionin start. The first Riken id you use, 1500004F05Rik, has no UniProt associated whereas the second one has the three UniProt sequences. 1500004F05Rik is an unclassifed gene (none of the sequence providers could make a call).
Only NCBI lists it as ncRNA, but Vega and Ensembl call it a protein coding gene. MGI protein coding gene MGI:1923755 (based on Ensembl + Vega)
VEGA Gene Model PUTATIVE_protein_coding OTTMUSG00000033544 Ensembl Gene Model protein_coding ENSMUSG00000068099 NCBI Gene Model ncRNA 76505
Often times when we get new build data, some of these weird ones get reclassified. Also,If you look at our gp2rna.mgi file, some of these ids have UniProt ids associated with the gene.
In the current release candidate, we have
From @hdrabkin The difference between ‘protein’ and 'protein-coding gene':
@hdrabkin can you explain what are genes and gene products?
In the amigo staging site the "gene_product" types seem to have only ND annotations and have gene names with 'opposite strand', Several have 'antisense lncRNA gene' or other 'antisense' biotypes, but they also have biotype conflicts. Many are 'predicted genes'
When we looked at the staging amigo site last time you asked, everything looks fine. https://amigo-staging.geneontology.io/amigo/search/annotation
This ticket is really old. @ukemi could you clarify what we do now.
We use the SO identifier for the object that has been annotated. If the object is a represented by an MGI identifier it get assigned the appropriate SO type for the identifier's classification in MGI. If it is a PRO id, the object annotated is specifically to a protein. They types are quite literal and are assigned by the MGI sequence curation team. A biotype conflict indicates that different sequence curation groups have curated the sequences associated with a given marker in different ways, but we take the decision made by the MGI group.
Thanks- the explanation was clear between protein and protein-coding; I am not clear what genes and gene products correspond to, what I understand from @ukemi 's comment is that these cannot be assigned more precisely.
'Gene' seems odd; presumably any function annotated in GO would at least be to a gene product?
Have a look at them in MGI. Over on the right hand side, see what kind of feature types they are.
It looks like many of the 'genes' are predicted genes that are essentially placeholders and get the root annotations for now. They my disappear or get promoted, but sometimes people identify them in their analyses. In MGI they are 'unclassified genes'.
ok thanks! I think this can close.
E.g. in the MGI GAF
What makes the 2nd one a protein and the 1st a gene?
The page on JAX is kind of odd http://www.informatics.jax.org/marker/MGI:1923755
"Feature Type protein coding gene"
Yet it's an ortholog of a lincRNA
I guess that's what conflict means
Looks like if there is a conflict, that results in the field in GO being 'protein' rather than 'gene'. But this is weird as the conflict apparently arises as the fact this is ncRNA...?
Either way: don't trust the type field in the MGI GAF