geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
35 stars 10 forks source link

Can we reduce the number of allowed entity types? #2740

Open ValWood opened 8 years ago

ValWood commented 8 years ago

misleading

If we want types, (which are be useful) maybe we should all specify protein, ncRNA, miRNA etc and ditch "gene_product" (if you don't know what type then you probably wouldn't be annotating it)

cmungall commented 8 years ago

OK, I think we'll have to break this down:

First we need better QC and standards for data coming in. E.g. the weird gene_product category is being used for RNAs in ZFIN. E.g.

We can try and auto-clean this during the load but better to fix upstream.

Second, we need to get this into an ontology structure so that we can choose the subset we want for the groupings (as we did for taxon, evidence). It's not quite as straightforward as using SO (complex out of scope).

How about this hierarchy:

Note protein-or-protein-coding would lump UniProt GCRPs and MOD genes together. I think this is a good thing, the current protein category is not a meaningful discriminator.

ValWood commented 8 years ago

I like that...

cmungall commented 7 years ago

Another example of inconsistent usage: https://github.com/geneontology/go-annotation/issues/1554

pgaudet commented 4 years ago

from @lpalbou

Looking at the stats:

Screen Shot 2019-07-17 at 3 17 40 PM

About 1/5th of our bioentities are described as generic gene_products without detailing if they are proteins, genes or rna. We discussed this with @thomaspd and it seems it concerns mostly AspGD and CGD bioentities but also TAIR, FlyBase and GO_Central to a lesser extent:

Screen Shot 2019-07-17 at 3 23 43 PM
pgaudet commented 4 years ago

From @thomaspd

I think we should discuss with these annotation groups to encourage them to either label them all as type "gene", or be more precise about the type of product ("protein" or "RNA") like the other groups. @pgaudet what do you think?

pgaudet commented 4 years ago

@pgaudet I don' like 'gene' because I think we do want to make the distinction between a gene and a gene product, for example when we annotate targets of transcription factors.

However it should be doable for the different groups to provide more precise information - @tberardini @marekskrzypek @hatrill : would it be possible for you to change ?

I am curious to know what the GOC annotations to gene_product are. Are we using the same type provided by the original database ?

Thanks, Pascale

@marekskrzypek

I suppose it should be possible, once a decision is reached.

@tberardini

We use 'gene_product' for cases where we are annotating genetic loci (uncloned genes). They are most likely protein coding genes but we don't know that for sure and therefore want to hedge our bets and annotate to 'gene_product' instead.

tberardini commented 4 years ago

However it should be doable for the different groups to provide more precise information - @tberardini @marekskrzypek @hatrill : would it be possible for you to change ?

As stated before, we use 'gene_product' for cases where we are annotating genetic loci (uncloned genes). They are most likely protein coding genes but we don't know that for sure and therefore want to hedge our bets and annotate to 'gene_product' instead.

When a gene is cloned and corresponds to a formerly uncloned locus, we are able to update to a more specific entity type. Without more information, we cannot be more precise.

deustp01 commented 4 years ago

Tanya's comment highlights the point that the issue here can be one of missing information: there's rock-solid experimental evidence that mutating a gene specifically disrupts a normal function or process while providing no clues for a molecular mechanism. In a world with sufficient resources, the terms in the initial comment that started this ticket could be made into an ontology (RNA is_a gene product; tRNA is_a RNA, etc.) and curators could be provided with documentation and examples to encourage consistent use of leafiest available term consistent with the experimental evidence.

suzialeksander commented 4 months ago

From @kltm on Mar 21 https://github.com/geneontology/go-site/issues/2246

These are the counts and currently used values in post-QC GAFs

    570 antisense_lncRNA
      1 antisense_lncRNA_gene
   6262 antisense_RNA
    188 autocatalytically_spliced_intron
 543618 gene
 362496 gene_product
    170 gene_segment
      4 guide_RNA
   2562 hammerhead_ribozyme
      2 lincRNA
 132472 lncRNA
     23 lnc_RNA
     18 lncRNA_gene
  32965 miRNA
      2 miRNA_gene
1719451 misc_RNA
  47234 mRNA
 201054 ncRNA
   8172 other
    464 piRNA
   2923 precursor_RNA
 147790 pre_miRNA
1357853612 protein
 406877 protein_coding_gene
  22567 protein_complex
   1005 pseudogene
     72 pseudogenic_transcript
   3606 ribozyme
    434 RNA
   5547 RNase_MRP_RNA
 306437 RNase_P_RNA
12871329 rRNA
    192 scaRNA
    268 scRNA
      1 scRNA_gene
     15 siRNA
      1 sncRNA
 362852 snoRNA
 871408 snRNA
 170884 sRNA
 195433 SRP_RNA
   1053 telomerase_RNA
 103559 tmRNA
      2 transposable_element_gene
10232300 tRNA
      9 tRNA_gene
      2 uORF
      8 vault_RNA
      6 Y_RNA