geneontology / go-site

A collection of metadata, tools, and files associated with the Gene Ontology public web presence.
http://geneontology.org
BSD 3-Clause "New" or "Revised" License
45 stars 89 forks source link

RNAC RNA types are getting mangled by the pipeline (tested by gorule-0000001) #2246

Closed cmungall closed 1 month ago

cmungall commented 7 months ago

Source:

✗ curl -L -s https://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_rna.gaf.gz | gzip -dc | grep URS000075D95B_9606 | cut -f2,3,9-12
URS00004176D4_9606  URS00004176D4_9606  F   Homo sapiens (human) hsa-miR-185-5p     miRNA
URS000075D95B_9606  URS000075D95B_9606  F   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  F   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  F   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      lncRNA

what we end up publishing:

✗ curl -L -s http://current.geneontology.org/annotations/goa_human_rna.gaf.gz | gzip -dc | grep URS000075D95B_9606 | cut -f2,3,9-12
URS00004176D4_9606  URS00004176D4_9606  F   Homo sapiens (human) hsa-miR-185-5p     miRNA
URS000075D95B_9606  URS000075D95B_9606  F   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  F   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  P   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
URS000075D95B_9606  URS000075D95B_9606  C   Homo sapiens (human) X inactive specific transcript (XIST)      gene_product
  1. The RNA type should be preserved
  2. We should have a specific QC check on RNCA that anything with an RNCA ID must be an RNA subtype

Aside for @alexsign should probably be it's own ticket:

Why don't we get gene symbols for RNA types? This one (Xist) clearly has one https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:12810 - why don't we just propagate across from HGNC?

And not to overstuff this issue but there are issues with general RNCA/HGNC propagation on AGR. Recall AGR uses HGNCs: https://www.alliancegenome.org/gene/HGNC:12810 no GO annotatuion

Even though this gene obviously has a known function: https://amigo.geneontology.org/amigo/gene_product/RNAcentral:URS000075D95B_9606

kltm commented 7 months ago

@pgaudet Could we add this to the GORULEs project, as it may be due to a filter?

pgaudet commented 7 months ago

You mean there is is a GORULE that changes the entity type? Could you please point to which rule that is?

Thanks, Pascale

pgaudet commented 7 months ago

(I dont have permissions to add this to the GO-rules project; @kltm would you please do it?)

kltm commented 7 months ago

@pgaudet To clarify, I suspect the issue is that there is a "silent rule" that is converting (or dropping and re-adding information) such that field value lncRNA is getting outputted as gene_product. Technically, this may be an incorrect implementation of GORULE:0000001; let's keep in mind the GAF 2.2 doc statement:

DB Object Type will be one of the following: protein_complex; protein; transcript; ncRNA; rRNA; tRNA; snRNA; snoRNA; any subtype of ncRNA in the [Sequence Ontology](http://www.sequenceontology.org/browser/obob.cgi). If the precise product type is unknown, gene_product should be used. (https://geneontology.org/docs/go-annotation-file-gaf-format-2.2/#db-object-type-column-12).

With that definition, this would then include http://www.sequenceontology.org/browser/current_release/term/SO:0000655, which is lncRNA. Looking at annotations in AmiGO, I'd note that we have lncRNA_gene, antisense_lncRNA, and lnc_RNA. The last one there would be a variant of lncRNA, and I believe incorrect--the spec does not specify synonyms, but at the very least we should normalize to the proper term name.

My guess for whatever is going on is that the parser for col 12 is mistakenly bumping lncRNA and mistakenly allowing lnc_RNA in (or not normalizing). Ideally, we normalize to lncRNA; if not, I would at least expect lncRNA to pass in and lnc_RNA to be "fixed" to generic gene_product.

cmungall commented 7 months ago

Can we come up with a fixed static list of types. Saying any subtype of ncRNA is not good; there are 20 subtypes of tRNA, no one should be using these. There is also the issue of labels potentially changing. The number of annotatable distinct meaningful ncRNA types should be small.

mugitty commented 6 months ago

I used @cmungall's example and was able to reproduce. The parser is doing a lookup and defaulting to gene_product (as given in the specs). Currently, there is an entry for 'lnc_RNA' mapped to SO:0001877, but not for 'lncRNA'. I can add an entry for 'lncRNA' and map it to SO:0001877. @pgaudet, please create a lookup for the supported types, I want to ensure all allowed types are mapped.

kltm commented 6 months ago

@mugitty @pgaudet According to spec, it's a limited list plus a set of entries from the SO. As a compromise (https://github.com/geneontology/go-site/issues/2246), as we're not actively using SO and likely have never done so, let's pull the "used" subset from the current SO and make our used list static for the moment to prevent drift and issues like we're currently having.

pgaudet commented 6 months ago

@mugitty Can you give me all the types you find? And which ones are not mapped. It seem lncRNA should simply be a synonym of lnc_RNA.

I can see if I find matches that are more informative than 'gene product'.

Thanks, Pascale

kltm commented 6 months ago

@pgaudet Noting from here (https://github.com/geneontology/go-site/issues/2246), I think it's technically the opposite?

pgaudet commented 6 months ago

After discussion with @mugitty , I am attaching the allowed entity types and the suggestions for replacement for others. We will first check errors with this list, and we can change the list if needed.

2024-02-26-entities.xlsx

mugitty commented 6 months ago

Thanks @pgaudet , I will update to use this list and output a warning, if defaulting to gene_product.

kltm commented 6 months ago

As part of the "gaf tests", it would be good to add something to make sure that the synonyms are mapping back to the proper ID (i.e. lnc_RNA -> lncRNA).

mugitty commented 6 months ago

@kltm, @pgaudet's wants to only use the terms in the attachment. All others will default to gene_product with gorule-0000001 warning. Based on the number of warnings, the list may be updated

kltm commented 6 months ago

@pgaudet Clarifying that you're removing lncRNA (SO:0001877), only 13 of those, so mapping to...gene_product as mentioned in the spec? Currently, in AmiGO filters, we also have:

lncRNA_gene (6848)

What are these expected to map to? Without digging in, I think with the list you have lncRNA_gene -> gene_product? Or is the intention to use the ontology to map to biological_region? Perhaps we should add what is currently used?

mugitty commented 6 months ago

@pgaudet , I noticed a test for MGI that was failing with the proposed code update. For example, if there is a GAF line as follows: gaf = ["MGI", "MGI:1923503", "0610006L08Rik", "enables", "GO:0003674", "MGI:MGI:2156816|GO_REF:0000015", "ND", "", "F", "RIKEN cDNA 0610006L08 gene", "", "gene", "taxon:10090", "20120430", "MGI", "", ""]

"gene" will be converted to "gene_product". Is this expected?

pgaudet commented 5 months ago

Hi @mugitty Can you check in this directory in all the files *-src.gaf.gz: http://snapshot.geneontology.org/products/upstream_and_raw_data/index.html

whether entity types OTHER than the following are present:

protein_coding_gene SO:0001217 protein PR:000000001 gene_product CHEBI:33695 snRNA SO:0000274 ncRNA SO:0000655 rRNA SO:0000252 mRNA SO:0000234 lincRNA SO:0001463 tRNA SO:0000253 snoRNA SO:0000275 miRNA SO:0000276 scRNA SO:0000013 piRNA SO:0001035 tmRNA SO:0000584 SRP_RNA SO:0000590 ribozyme SO:0000374 telomerase_RNA SO:0000390 RNase_P_RNA SO:0000386 antisense_RNA SO:0000644 RNase_MRP_RNA SO:0000385 guide_RNA SO:0000602 hammerhead_ribozyme SO:0000380 pseudogene SO:0000336 protein_complex GO:0032991 antisense_lncRNA SO:0001904 gene_segment SO:3000000 genetic_marker SO:0001645 biological region SO:0001411 transposable_element_gene SO:0000111

and spit out any entity type that doesn't match these, on a file-by-file basis.

pgaudet commented 5 months ago

Alternatively - or in addition, could you give me a count of these different types:

protein_coding_gene SO:0001217 protein PR:000000001 gene_product CHEBI:33695 snRNA SO:0000274 ncRNA SO:0000655 rRNA SO:0000252 mRNA SO:0000234 lincRNA SO:0001463 tRNA SO:0000253 snoRNA SO:0000275 miRNA SO:0000276 scRNA SO:0000013 piRNA SO:0001035 tmRNA SO:0000584 SRP_RNA SO:0000590 ribozyme SO:0000374 telomerase_RNA SO:0000390 RNase_P_RNA SO:0000386 antisense_RNA SO:0000644 RNase_MRP_RNA SO:0000385 guide_RNA SO:0000602 hammerhead_ribozyme SO:0000380 pseudogene SO:0000336 protein_complex GO:0032991 antisense_lncRNA SO:0001904 gene_segment SO:3000000 genetic_marker SO:0001645 biological region SO:0001411 transposable_element_gene SO:0000111 gene SO:0000704 lincRNA_gene SO:0001641 lncRNA_gene SO:0002127 miRNA_gene SO:0001265 mRNA SO:0000234 ncRNA_gene SO:0001263 primary_transcript SO:0000185 RNA SO:0000356 RNase_MRP_RNA_gene SO:0001640 RNase_P_RNA_gene SO:0001639 rRNA_gene SO:0001637 scRNA_gene SO:0001266 sense_intronic_ncRNA_gene SO:0002184 sense_overlap_ncRNA_gene SO:0002183 snoRNA_gene SO:0001267 snRNA_gene SO:0001268 SRP_RNA_gene SO:0001269 telomerase_RNA_gene SO:0001643 transcript SO:0000673 tRNA_gene SO:0001272

Thanks, Pascale

kltm commented 5 months ago

@mugitty @pgaudet I'm running job to get numbers on col12.

kltm commented 5 months ago
    570 antisense_lncRNA
      1 antisense_lncRNA_gene
   6262 antisense_RNA
    188 autocatalytically_spliced_intron
 543618 gene
 362496 gene_product
    170 gene_segment
      4 guide_RNA
   2562 hammerhead_ribozyme
      2 lincRNA
 132472 lncRNA
     23 lnc_RNA
     18 lncRNA_gene
  32965 miRNA
      2 miRNA_gene
1719451 misc_RNA
  47234 mRNA
 201054 ncRNA
   8172 other
    464 piRNA
   2923 precursor_RNA
 147790 pre_miRNA
1357853612 protein
 406877 protein_coding_gene
  22567 protein_complex
   1005 pseudogene
     72 pseudogenic_transcript
   3606 ribozyme
    434 RNA
   5547 RNase_MRP_RNA
 306437 RNase_P_RNA
12871329 rRNA
    192 scaRNA
    268 scRNA
      1 scRNA_gene
     15 siRNA
      1 sncRNA
 362852 snoRNA
 871408 snRNA
 170884 sRNA
 195433 SRP_RNA
   1053 telomerase_RNA
 103559 tmRNA
      2 transposable_element_gene
10232300 tRNA
      9 tRNA_gene
      2 uORF
      8 vault_RNA
      6 Y_RNA
kltm commented 5 months ago

(Noting that reactome and zfin need to [obviously] fix their GAF.)

mugitty commented 5 months ago

@pgaudet, do you still want me to output the types for each file or is @kltm 's output good enough for now?

pgaudet commented 5 months ago

(Noting that reactome and zfin need to [obviously] fix their GAF.)

Isn't this gorule-0000001 ? It seems it should be a hard error.

pgaudet commented 5 months ago

@mugitty

Yes @kltm 's output is fine to get started.

mugitty commented 5 months ago

@pgaudet, Just to confirm. So the types you added to this ticket on February 26, 2024 are valid. I have already updated the code

can I do a pull request?

pgaudet commented 4 months ago

For reference - these are the types that GOA loads from RNA central

rRNA 12894606 tRNA 10330538 misc_RNA 1720365 snRNA 833596 snoRNA 350816 RNase_P_RNA 307952 ncRNA 199135 SRP_RNA 194578 sRNA 167541 pre_miRNA 138520 lncRNA 114530 tmRNA 104453 miRNA 24008 other 7295 antisense_RNA 6196 RNase_MRP_RNA 5464 ribozyme 3603 precursor_RNA 2893 hammerhead_ribozyme 2556 telomerase_RNA 885 piRNA 445 scRNA 262 scaRNA 195 autocatalytically_spliced_intron 189 siRNA 15 vault_RNA 8 Y_RNA 6 guide_RNA 2

pgaudet commented 4 months ago

@kltm should I make a new GO rule for entity types?

kltm commented 4 months ago

@pgaudet That would be great.

pgaudet commented 3 months ago

Hi @mugitty

Here are repairs we should implement:

These have to be added to the GO list of CURIEs:

That should take care of many issues.

However, these types are not in SO:

@mugitty and I propose to continue to change them to 'gene product' and output a warning.

Thanks, Pascale

pgaudet commented 1 month ago

I need to add tests for the entity types ; can we first check snapshot to see if disallowed types are being reported?

pgaudet commented 1 month ago

This problem is fixed:

Image

pgaudet commented 1 month ago

lncRNA amiGO:

Image

lncRNA staging

Image

pgaudet commented 1 month ago

Correctly fails test in test GAF

WARNING - Invalid subject type:defaulting to 'gene_product'--UniProtKB A1B2F4 aztD acts upstream of GO:0097696 PMID:26468286 IDA F GORULE_TEST:0000001-28 misc_RNA taxon:318586 20230427 GO_Central