Closed cmungall closed 4 months ago
@pgaudet Could we add this to the GORULEs project, as it may be due to a filter?
You mean there is is a GORULE that changes the entity type? Could you please point to which rule that is?
Thanks, Pascale
(I dont have permissions to add this to the GO-rules project; @kltm would you please do it?)
@pgaudet To clarify, I suspect the issue is that there is a "silent rule" that is converting (or dropping and re-adding information) such that field value lncRNA
is getting outputted as gene_product
. Technically, this may be an incorrect implementation of GORULE:0000001; let's keep in mind the GAF 2.2 doc statement:
DB Object Type will be one of the following: protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the [Sequence Ontology](http://www.sequenceontology.org/browser/obob.cgi). If the precise product type is unknown, gene_product should be used.
(https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/#db-object-type-column-12).
With that definition, this would then include http://www.sequenceontology.org/browser/current_release/term/SO:0000655, which is lncRNA
. Looking at annotations in AmiGO, I'd note that we have lncRNA_gene
, antisense_lncRNA
, and lnc_RNA
. The last one there would be a variant of lncRNA
, and I believe incorrect--the spec does not specify synonyms, but at the very least we should normalize to the proper term name.
My guess for whatever is going on is that the parser for col 12 is mistakenly bumping lncRNA
and mistakenly allowing lnc_RNA
in (or not normalizing). Ideally, we normalize to lncRNA
; if not, I would at least expect lncRNA
to pass in and lnc_RNA
to be "fixed" to generic gene_product
.
Can we come up with a fixed static list of types. Saying any subtype of ncRNA is not good; there are 20 subtypes of tRNA, no one should be using these. There is also the issue of labels potentially changing. The number of annotatable distinct meaningful ncRNA types should be small.
I used @cmungall's example and was able to reproduce. The parser is doing a lookup and defaulting to gene_product (as given in the specs). Currently, there is an entry for 'lnc_RNA' mapped to SO:0001877, but not for 'lncRNA'. I can add an entry for 'lncRNA' and map it to SO:0001877. @pgaudet, please create a lookup for the supported types, I want to ensure all allowed types are mapped.
@mugitty @pgaudet According to spec, it's a limited list plus a set of entries from the SO. As a compromise (https://github.com/geneontology/go-site/issues/2246), as we're not actively using SO and likely have never done so, let's pull the "used" subset from the current SO and make our used list static for the moment to prevent drift and issues like we're currently having.
@mugitty Can you give me all the types you find? And which ones are not mapped. It seem lncRNA should simply be a synonym of lnc_RNA.
I can see if I find matches that are more informative than 'gene product'.
Thanks, Pascale
@pgaudet Noting from here (https://github.com/geneontology/go-site/issues/2246), I think it's technically the opposite?
After discussion with @mugitty , I am attaching the allowed entity types and the suggestions for replacement for others. We will first check errors with this list, and we can change the list if needed.
Thanks @pgaudet , I will update to use this list and output a warning, if defaulting to gene_product.
As part of the "gaf tests", it would be good to add something to make sure that the synonyms are mapping back to the proper ID (i.e. lnc_RNA
-> lncRNA
).
@kltm, @pgaudet's wants to only use the terms in the attachment. All others will default to gene_product with gorule-0000001 warning. Based on the number of warnings, the list may be updated
@pgaudet Clarifying that you're removing lncRNA
(SO:0001877), only 13 of those, so mapping to...gene_product
as mentioned in the spec? Currently, in AmiGO filters, we also have:
lncRNA_gene (6848)
What are these expected to map to? Without digging in, I think with the list you have
lncRNA_gene
-> gene_product
? Or is the intention to use the ontology to map to biological_region
? Perhaps we should add what is currently used?
@pgaudet , I noticed a test for MGI that was failing with the proposed code update. For example, if there is a GAF line as follows: gaf = ["MGI", "MGI:1923503", "0610006L08Rik", "enables", "GO:0003674", "MGI:MGI:2156816|GO_REF:0000015", "ND", "", "F", "RIKEN cDNA 0610006L08 gene", "", "gene", "taxon:10090", "20120430", "MGI", "", ""]
"gene" will be converted to "gene_product". Is this expected?
Hi @mugitty Can you check in this directory in all the files *-src.gaf.gz: http://snapshot.geneontology.org/products/upstream_and_raw_data/index.html
whether entity types OTHER than the following are present:
protein_coding_gene SO:0001217 protein PR:000000001 gene_product CHEBI:33695 snRNA SO:0000274 ncRNA SO:0000655 rRNA SO:0000252 mRNA SO:0000234 lincRNA SO:0001463 tRNA SO:0000253 snoRNA SO:0000275 miRNA SO:0000276 scRNA SO:0000013 piRNA SO:0001035 tmRNA SO:0000584 SRP_RNA SO:0000590 ribozyme SO:0000374 telomerase_RNA SO:0000390 RNase_P_RNA SO:0000386 antisense_RNA SO:0000644 RNase_MRP_RNA SO:0000385 guide_RNA SO:0000602 hammerhead_ribozyme SO:0000380 pseudogene SO:0000336 protein_complex GO:0032991 antisense_lncRNA SO:0001904 gene_segment SO:3000000 genetic_marker SO:0001645 biological region SO:0001411 transposable_element_gene SO:0000111
and spit out any entity type that doesn't match these, on a file-by-file basis.
Alternatively - or in addition, could you give me a count of these different types:
protein_coding_gene SO:0001217 protein PR:000000001 gene_product CHEBI:33695 snRNA SO:0000274 ncRNA SO:0000655 rRNA SO:0000252 mRNA SO:0000234 lincRNA SO:0001463 tRNA SO:0000253 snoRNA SO:0000275 miRNA SO:0000276 scRNA SO:0000013 piRNA SO:0001035 tmRNA SO:0000584 SRP_RNA SO:0000590 ribozyme SO:0000374 telomerase_RNA SO:0000390 RNase_P_RNA SO:0000386 antisense_RNA SO:0000644 RNase_MRP_RNA SO:0000385 guide_RNA SO:0000602 hammerhead_ribozyme SO:0000380 pseudogene SO:0000336 protein_complex GO:0032991 antisense_lncRNA SO:0001904 gene_segment SO:3000000 genetic_marker SO:0001645 biological region SO:0001411 transposable_element_gene SO:0000111 gene SO:0000704 lincRNA_gene SO:0001641 lncRNA_gene SO:0002127 miRNA_gene SO:0001265 mRNA SO:0000234 ncRNA_gene SO:0001263 primary_transcript SO:0000185 RNA SO:0000356 RNase_MRP_RNA_gene SO:0001640 RNase_P_RNA_gene SO:0001639 rRNA_gene SO:0001637 scRNA_gene SO:0001266 sense_intronic_ncRNA_gene SO:0002184 sense_overlap_ncRNA_gene SO:0002183 snoRNA_gene SO:0001267 snRNA_gene SO:0001268 SRP_RNA_gene SO:0001269 telomerase_RNA_gene SO:0001643 transcript SO:0000673 tRNA_gene SO:0001272
Thanks, Pascale
@mugitty @pgaudet I'm running job to get numbers on col12.
570 antisense_lncRNA
1 antisense_lncRNA_gene
6262 antisense_RNA
188 autocatalytically_spliced_intron
543618 gene
362496 gene_product
170 gene_segment
4 guide_RNA
2562 hammerhead_ribozyme
2 lincRNA
132472 lncRNA
23 lnc_RNA
18 lncRNA_gene
32965 miRNA
2 miRNA_gene
1719451 misc_RNA
47234 mRNA
201054 ncRNA
8172 other
464 piRNA
2923 precursor_RNA
147790 pre_miRNA
1357853612 protein
406877 protein_coding_gene
22567 protein_complex
1005 pseudogene
72 pseudogenic_transcript
3606 ribozyme
434 RNA
5547 RNase_MRP_RNA
306437 RNase_P_RNA
12871329 rRNA
192 scaRNA
268 scRNA
1 scRNA_gene
15 siRNA
1 sncRNA
362852 snoRNA
871408 snRNA
170884 sRNA
195433 SRP_RNA
1053 telomerase_RNA
103559 tmRNA
2 transposable_element_gene
10232300 tRNA
9 tRNA_gene
2 uORF
8 vault_RNA
6 Y_RNA
(Noting that reactome and zfin need to [obviously] fix their GAF.)
@pgaudet, do you still want me to output the types for each file or is @kltm 's output good enough for now?
(Noting that reactome and zfin need to [obviously] fix their GAF.)
Isn't this gorule-0000001 ? It seems it should be a hard error.
@mugitty
Yes @kltm 's output is fine to get started.
@pgaudet, Just to confirm. So the types you added to this ticket on February 26, 2024 are valid. I have already updated the code
can I do a pull request?
For reference - these are the types that GOA loads from RNA central
rRNA 12894606 tRNA 10330538 misc_RNA 1720365 snRNA 833596 snoRNA 350816 RNase_P_RNA 307952 ncRNA 199135 SRP_RNA 194578 sRNA 167541 pre_miRNA 138520 lncRNA 114530 tmRNA 104453 miRNA 24008 other 7295 antisense_RNA 6196 RNase_MRP_RNA 5464 ribozyme 3603 precursor_RNA 2893 hammerhead_ribozyme 2556 telomerase_RNA 885 piRNA 445 scRNA 262 scaRNA 195 autocatalytically_spliced_intron 189 siRNA 15 vault_RNA 8 Y_RNA 6 guide_RNA 2
@kltm should I make a new GO rule for entity types?
@pgaudet That would be great.
Hi @mugitty
Here are repairs we should implement:
These have to be added to the GO list of CURIEs:
That should take care of many issues.
However, these types are not in SO:
@mugitty and I propose to continue to change them to 'gene product' and output a warning.
Thanks, Pascale
I need to add tests for the entity types ; can we first check snapshot to see if disallowed types are being reported?
This problem is fixed:
lncRNA amiGO:
lncRNA staging
Source:
what we end up publishing:
Aside for @alexsign should probably be it's own ticket:
Why don't we get gene symbols for RNA types? This one (Xist) clearly has one https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:12810 - why don't we just propagate across from HGNC?
And not to overstuff this issue but there are issues with general RNCA/HGNC propagation on AGR. Recall AGR uses HGNCs: https://www.alliancegenome.org/gene/HGNC:12810 no GO annotatuion
Even though this gene obviously has a known function: https://amigo.geneontology.org/amigo/gene_product/RNAcentral:URS000075D95B_9606