redundancy from RNA central

kltm commented 6 years ago

From @ValWood on June 15, 2018 17:34

Having almost sorted out redundancy in proteins, we now have multiple copies of the same RNA from RNA central (at least 5 copies of telomerase RNA here) Who should I tag?

http://amigo.geneontology.org/amigo/search/bioentity?q=*:*&fq=regulates_closure:%22GO:0005697%22&sfq=document_category:%22bioentity%22

Copied from original issue: geneontology/amigo#511

kltm commented 6 years ago

From @ValWood on July 5, 2018 18:3

Who would look into this? I'm not sure...

kltm commented 6 years ago

From @ValWood on July 5, 2018 18:3

@cmungall @kltm ?

kltm commented 6 years ago

Let's start this out on the annotation tracker.

cmungall commented 6 years ago

Can we add @alexsign to this tracker? Assigned @tonysawfordebi for now, although this may be upstream of him at RNA Central

Can you give us two RNA Central IDs you think should be collapsed here @ValWood? Perhaps these two?

ValWood commented 6 years ago

I'm not sure now. I'm sure that last week when I filtered GO:0005697 telomerase holoenzyme complex
for human and product type RNA I got 5 entries, but now I only see these 2.

As far as I'm aware there is only one RNA molecule that is part of the telomerase complex Prof. @RLovering is that correct?

So yes, these two are either the same entity, or one of them isn't really a telomerase RNA (there are other telomere derived RNA's)

ValWood commented 6 years ago

both have telomerase_RNA http://www.sequenceontology.org/browser/current_svn/term/SO:0000390 (there is only one of these)

rachhuntley commented 6 years ago

This is probably because unique RNAcentral identifiers are assigned to every distinct sequence I'm tagging Anton and Blake from RNAcentral @AntonPetrov @blakesweeney

AntonPetrov commented 6 years ago

Indeed, as @rachhuntley suggests, unique RNAcentral identifiers are assigned to every distinct sequence, and in this case several databases have different sequences for telomerase RNA:

URS00004A7003_9606: Ensembl, GENCODE, LNCipedia, Rfam
URS00004A7003_9606: HGNC, RefSeq, ENA

You can see the overlapping sequences in the genome browser section of any of these sequences:

Depending on the goal, there may be different approaches:

raise this with HGNC/Ensembl/RefSeq
pick one of these databases and always choose its sequences for annotation
go back to the literature to decide which sequence is better supported

ValWood commented 6 years ago

The most practical/sensible option would be to use HGNC.

ValWood commented 6 years ago

Then we would be sure we were using the correct symbols.

ValWood commented 6 years ago

Hang on, isn't the Rfam a HMM of the conserved region and not the gene feature per se?

AntonPetrov commented 6 years ago

Rfam is a source of RNAcentral sequences in 2 ways:

Rfam itself submits genome annotations to RNAcentral
Some entries in other databases originate from Rfam hits. It's possible that URS00004A7003_9606 sequence got in Ensembl, GENCODE, and LNCipedia on the basis of an Rfam match but I can't tell for sure.

On top of that, we run Rfam covariance models on all RNAcentral sequences to propagate GO terms and perform some quality checks.

Does that help @ValWood?

RLovering commented 6 years ago

Hi in case this helps Nancy annotated 2 TERC RNAcentral IDs because in her words: Both of the RNAcentral identifiers for TERC were annotated identically in the GO database to ensure coverage of both identifiers; URS00004A7003_9606 (mapped to NCBI GeneID 7012) and URS00004416C5_9606 (mapped Ensembl GeneID ENSG00000270141.3)

ValWood commented 6 years ago

OK thanks @RLovering @AntonPetrov

Got it, the GO annotations come in via UCL.

So @AntonPetrov suggestion that we need to choose one ID set to annotate to for human is correct. Other organisms will use MOD IDs, so only human will be affected in the short term (I presume?)

All other resources can propagate from GO, as they do with protein annotation, but it isn't our job to directly annotate every entry in every other database, we should only need to annotate to a single database object.

Basically we need to make a decision about the ID set so that we only get one RNA entry in GO per loci. Everything else should be alternative IDs or db-xrefs.

GO managers meeting or GO meeting ? @pgaudet @cmungall ?

rachhuntley commented 6 years ago

Not just from UCL, RNAcentral make their own IEA annotations too, and other group have used RNAcentral IDs too, including MGI and SGD.

It can also be complicated by the different ways in which Ensembl and NCBI do their gene builds. I can't remember the details, but when I was looking into annotating lncRNAs we couldn't just pick one identifier because that would mean that our annotations wouldn't appear in either Ensembl or NCBI, depending on which one we didn't pick, this is probably why Nancy annotated both. I went to an Ensembl talk a year or so ago where they said they were working on resolving the problem, but I don't know if any progress has been made (this apparently is not a problem for the miRNA annotations - I can probably dig out the email I got from NCBI)

ValWood commented 6 years ago

@rachhuntley that would be good.

Ensembl and NCBI should be able to propagate from any unique ID set we decide upon (in much the same way that they do currently for proteins). We don't annotate to Ensemlb and NCBI protein IDs to make the data appear, that should happen via a pipeline, and be attributed to the annotation provider.

cmungall commented 6 years ago

All other resources can propagate from GO, as they do with protein annotation, but it isn't our job to directly annotate every entry in every other database, we should only need to annotate to a single database object.

+1

I think it would be good if curators filed a ticket on this tracker as soon as they see a duplicate. I can see the rationale for duplicating the annotations on both IDs, but in addition to the problems Val mentions, it can confound analyses, and it can make interpretation after the fact difficult (e.g. can we tell if the intention was to indicate that two spliceforms have the same function, or just to ensure ID coverage).

@pgaudet let's bring up at a managers' meeting, I think the topic is also something that should be discussed in Montreal.

cmungall commented 6 years ago

Re: use of HGNC. Is coverage sufficient (too lazy to check..)? If I like this option. Could be at annotation time or mapping at release time.

@tonysawfordebi / @alexsign - would it be easy to get HGNC xrefs added to ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/goa_human_rna.gz?

ValWood commented 6 years ago

I asked HGNC for input

ukemi commented 6 years ago

I'm pretty sure MGI uses MGI identifiers. I have asked Dmitry to confirm.

rachhuntley commented 6 years ago

@ukemi but the MGI IDs get mapped to RNAcentral sequences for display in QuickGO.

@ValWood Here are the comments I got from Terence Murphy at NCBI when we were getting the miRNA annotations to appear in NCBI, I can't remember who gave the Ensembl talk, but it might be worth getting their comments too.

Dear Rachael,

We’re already in the process of adding this functionality. I’ll check where we’re at with the implementation.

We map GO terms at the level of GeneIDs, so we won’t be able to distinguish between GO annotations attached to 5p and 3p products, but otherwise this should work for cases where the RNAcentral ID is equivalent to a RefSeq transcript (true for the miRNAs). Do you have a plan for how to handle other non-coding transcripts where RefSeq and Ensembl are different? E.g. XIST NR_001564.2 and ENST00000429829.5 are just slightly different at the 3’ end (by 5 bp, plus a polyA tail on the RefSeq), but have different RNAcentral IDs. Would you add GO terms for both RNAcentral IDs?

miRNAs are generally consistent between Ensembl and RefSeq because we both rely on miRBase for the feature definitions. For other ncRNAs, you’re likely to see more variation since we haven’t had any CCDS-like effort to precisely match annotations. In NCBI Gene we do compute and report Ensembl transcripts that are similar to particular RefSeqs. For example: https://www.ncbi.nlm.nih.gov/gene/7503 in the “Reference Sequences” section you’ll see that ENST00000429829.5 is reported as “Related” to NR_001564.2. This is the best match, but allows for some differences in splicing and UTR lengths. We currently don’t report the degree of similarity.

It may be a moot point, if publications typically don’t report which transcript they’re analyzing. For XIST, NR_001564.2 is the major form but there would be a bit of a leap of faith to attach functional annotations to that specific transcript if the publication isn’t specific. It may ultimately be better to use gene-level identifiers, GeneIDs, Ensembl gene identifiers, nomenclature group identifiers, or locus_tags, if that’s the more typical anchor used in publications. I haven’t read enough of the ncRNA literature to say it that will be necessary to be able to annotate lncRNAs.

alexsign commented 6 years ago

@cmungall we added HGNC xrefs for RNAcentral identifiers into our pipeline. It will be available in the GPI files from next release, which is in about a week time.

pgaudet commented 2 years ago

Can this be closed?

ValWood commented 2 years ago

RNA entity redundancy has not yet been addressed. These are the same telomerase RNA: http://amigo.geneontology.org/amigo/gene_product/RNAcentral:URS00006F4087_9606 http://amigo.geneontology.org/amigo/gene_product/RNAcentral:URS00004A7003_9606

But I don't know if RNACentral yet provides a non redundent ID set (it might be in progress).

@blakesweeny

pgaudet commented 1 year ago

@blakesweeney any update on the Rfam redundancy issue?

blakesweeney commented 1 year ago

Nothing yet I'm afaird.

ValWood commented 10 months ago

There is still redundancy, 4 copies of human telomerase:

Perhaps an issue for the RNACentral meeting @blakesweeney @alexsign

blakesweeney commented 10 months ago

Thanks for the suggestion, we'll look at having a session on this.

suzialeksander commented 1 week ago

Bumping the issue, as I still see the same results in Val's last screenshot in Sep 23. @blakesweeney are there any updates?

ValWood commented 1 week ago

I believe at the moment this is a 'feature' and will be resolved once RNACentral has unique IDs for RNA entities, so we could probably close this GO tracker ticket.

pgaudet commented 7 hours ago

thanks!

geneontology / go-annotation

redundancy from RNA central #2023