Open ValWood opened 8 years ago
OK, I think we'll have to break this down:
First we need better QC and standards for data coming in. E.g. the weird gene_product category is being used for RNAs in ZFIN. E.g.
gene_product
in GAFWe can try and auto-clean this during the load but better to fix upstream.
Second, we need to get this into an ontology structure so that we can choose the subset we want for the groupings (as we did for taxon, evidence). It's not quite as straightforward as using SO (complex out of scope).
How about this hierarchy:
Note protein-or-protein-coding would lump UniProt GCRPs and MOD genes together. I think this is a good thing, the current protein
category is not a meaningful discriminator.
I like that...
Another example of inconsistent usage: https://github.com/geneontology/go-annotation/issues/1554
from @lpalbou
Looking at the stats:
About 1/5th of our bioentities are described as generic gene_products without detailing if they are proteins, genes or rna. We discussed this with @thomaspd and it seems it concerns mostly AspGD and CGD bioentities but also TAIR, FlyBase and GO_Central to a lesser extent:
From @thomaspd
I think we should discuss with these annotation groups to encourage them to either label them all as type "gene", or be more precise about the type of product ("protein" or "RNA") like the other groups. @pgaudet what do you think?
@pgaudet I don' like 'gene' because I think we do want to make the distinction between a gene and a gene product, for example when we annotate targets of transcription factors.
However it should be doable for the different groups to provide more precise information - @tberardini @marekskrzypek @hatrill : would it be possible for you to change ?
I am curious to know what the GOC annotations to gene_product are. Are we using the same type provided by the original database ?
Thanks, Pascale
@marekskrzypek
I suppose it should be possible, once a decision is reached.
@tberardini
We use 'gene_product' for cases where we are annotating genetic loci (uncloned genes). They are most likely protein coding genes but we don't know that for sure and therefore want to hedge our bets and annotate to 'gene_product' instead.
However it should be doable for the different groups to provide more precise information - @tberardini @marekskrzypek @hatrill : would it be possible for you to change ?
As stated before, we use 'gene_product' for cases where we are annotating genetic loci (uncloned genes). They are most likely protein coding genes but we don't know that for sure and therefore want to hedge our bets and annotate to 'gene_product' instead.
When a gene is cloned and corresponds to a formerly uncloned locus, we are able to update to a more specific entity type. Without more information, we cannot be more precise.
Tanya's comment highlights the point that the issue here can be one of missing information: there's rock-solid experimental evidence that mutating a gene specifically disrupts a normal function or process while providing no clues for a molecular mechanism. In a world with sufficient resources, the terms in the initial comment that started this ticket could be made into an ontology (RNA is_a gene product; tRNA is_a RNA, etc.) and curators could be provided with documentation and examples to encourage consistent use of leafiest available term consistent with the experimental evidence.
From @kltm on Mar 21 https://github.com/geneontology/go-site/issues/2246
These are the counts and currently used values in post-QC GAFs
570 antisense_lncRNA
1 antisense_lncRNA_gene
6262 antisense_RNA
188 autocatalytically_spliced_intron
543618 gene
362496 gene_product
170 gene_segment
4 guide_RNA
2562 hammerhead_ribozyme
2 lincRNA
132472 lncRNA
23 lnc_RNA
18 lncRNA_gene
32965 miRNA
2 miRNA_gene
1719451 misc_RNA
47234 mRNA
201054 ncRNA
8172 other
464 piRNA
2923 precursor_RNA
147790 pre_miRNA
1357853612 protein
406877 protein_coding_gene
22567 protein_complex
1005 pseudogene
72 pseudogenic_transcript
3606 ribozyme
434 RNA
5547 RNase_MRP_RNA
306437 RNase_P_RNA
12871329 rRNA
192 scaRNA
268 scRNA
1 scRNA_gene
15 siRNA
1 sncRNA
362852 snoRNA
871408 snRNA
170884 sRNA
195433 SRP_RNA
1053 telomerase_RNA
103559 tmRNA
2 transposable_element_gene
10232300 tRNA
9 tRNA_gene
2 uORF
8 vault_RNA
6 Y_RNA
If we want types, (which are be useful) maybe we should all specify protein, ncRNA, miRNA etc and ditch "gene_product" (if you don't know what type then you probably wouldn't be annotating it)