The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

Remove or review ncRNA terms that mirror Rfam families #546

Open AntonPetrov opened 3 years ago

AntonPetrov commented 3 years ago

There are several ncRNA terms that refer to specific RNA families and mirror the Rfam RNA family classification, for example:

These terms appear to be too specific. For example, class_I_RNA is a small RNA family restricted to Dictyostelium but in SO it is found at the same level as lncRNA which is a broad class of RNAs with a wide distribution.

Here are other terms with the same problem: spot_42_RNA , class_II_RNA, CsrB_RsmB_RNA, DsrA_RNA, GcvB_RNA, RRE_RNA, RprA_RNA.

I think it would be a good idea to remove these "RNA family" terms from Sequence Ontology.

Alternatively, one could review these families to improve consistency and potentially include other Rfam families (which may not work well as Rfam contains ~4,000 families).

These improvements would directly benefit @RNAcentral annotations, as RNAcentral uses Sequence Ontology as the main source of RNA types and relationships between them.

Any help will be greatly appreciated! Please let me know if you need any further information.

CC @blakesweeney @sjm41

AntonPetrov commented 3 years ago

A quick update following today's meeting:

For example, class_II_RNA was created in 2006 (RSC).

sjm41 commented 2 years ago

It's not very scientific, but I just did a quick google search with the SO IDs for the 10 terms mentioned above:

OxyS_RNA (SO:0000384) - only relevant hits are sequenceontology.org & wikipedia entry MicF_RNA (SO:0000383) - only relevant hits are sequenceontology.org & wikipedia entry spot_42_RNA (SO:0000389) - only relevant hits are sequenceontology.org & wikipedia entry class_I_RNA (SO:0000990) - only relevant hits are sequenceontology.org & RSC page class_II_RNA (SO:0000989) - only relevant hit is sequenceontology.org CsrB_RsmB_RNA (SO:0000377) - only relevant hits are sequenceontology.org & wikipedia entry & ChickpeaMine entry (which says "This SOTerm isn't in any lists") DsrA_RNA (SO:0000378) - only relevant hits are sequenceontology.org & wikipedia entry & one Rfam alignment GcvB_RNA (SO:0000379) - only relevant hits are sequenceontology.org & wikipedia entry & HandWiki entry RRE_RNA (SO:0000388) - only relevant hits are sequenceontology.org & one Rfam alignment RprA_RNA (SO:0000387) - only relevant hits are sequenceontology.org & wikipedia entry & one Rfam alignment

So, not really any support from that approach for keeping these terms. (Doing the same search with other ncRNA SO IDs tended to give more hits, often to MODs.)

egchristensen commented 2 years ago

When it comes to a term history, we really only have the current set of GitHub issues and the deprecated term tracker on sourceforge. There is a wiki, but I don't think it'll be much help here. If we can't find relevant history on GitHub or SourceForge, then we'll probably need to reach out to individual stakeholders to confirm. To prepare for such a review, we'll need to compile a list of all relevant SO terms that we want to take a look at here. Does the list of terms we need to evaluate go much beyond what you've written above @sjm41?

sjm41 commented 1 year ago

@egchristensen Yes, the list of 10 SO terms above is the complete list of Rfam-mirrored SO terms we think should be obsoleted (though there are other snoRNA terms suggested for obsoletion in #576).

Do you have any ideas for how to reach out to "individual stakeholders" (beyond the person/group who originally requested the term)? I couldn't think of a good method to try to find current usage of SO terms, other than googling...

sjm41 commented 1 year ago

I just systematically searched through the old sourceforge tracker, the current GitHub tracker and the ontology file for clues to the origin/requester of these 10 SO terms. I only found information for 2 terms: class_I_RNA (SO:0000990) - Requested by Karen Pilcher - Dictybase (2006) https://sourceforge.net/p/song/term-tracker/2/ class_II_RNA (SO:0000989) - Requested by Karen Pilcher - Dictybase (2006) https://sourceforge.net/p/song/term-tracker/2/

egchristensen commented 1 year ago

Let's go ahead and obsolete these terms and redirect folks to Rfam in a comment. Could I ask you to identify the proper RFam IDs please @sjm41?

egchristensen commented 1 year ago

@egchristensen Yes, the list of 10 SO terms above is the complete list of Rfam-mirrored SO terms we think should be obsoleted (though there are other snoRNA terms suggested for obsoletion in #576).

Do you have any ideas for how to reach out to "individual stakeholders" (beyond the person/group who originally requested the term)? I couldn't think of a good method to try to find current usage of SO terms, other than googling...

@sjm41 Part of what I'd like to do during my time as a PhD student is figure out a good way to represent current usage of terms along with their gradual evolution/refinement over time. I'm not sure if that'll be through NLP of literature, maybe twitter, or something else but that would at least get us part of the way. In the meantime, we've just been working with engaged stakeholders (e.g. the main annotation repositories), PubMed, or folks who are submitting GitHub tickets to assess whether or not we can obsolete a term. I would love to be more systematic with obsoleting terms, but we're still working on good ways to go about that.

sjm41 commented 1 year ago

Hi @egchristensen

Most of the corresponding Rfam IDs are already in the SO entries as xrefs/attributions, but here they are for the record: OxyS_RNA (SO:0000384) = RF00035 MicF_RNA (SO:0000383) = RF00033 spot_42_RNA (SO:0000389) = RF00021 class_I_RNA (SO:0000990) = RF01414 class_II_RNA (SO:0000989) = [not in RFAM] CsrB_RsmB_RNA (SO:0000377) = RF00018 DsrA_RNA (SO:0000378) = RF00014 GcvB_RNA (SO:0000379) = RF00022 RRE_RNA (SO:0000388) = RF00036 RprA_RNA (SO:0000387) = RF00034