The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

Remove 'sncRNA' and 'sncRNA_gene' grouping classes #572

Open sjm41 opened 2 years ago

sjm41 commented 2 years ago

Please remove 'sncRNA' (SO:0002247) and 'sncRNA_gene' (SO:0002342) grouping classes.

These 'small non-coding RNA' terms are defined simply based on length ("a ncRNA less than 200 nucleotides") - essentially, "not a lncRNA" (based on the current def of that term). This isn't a useful/workable definition or grouping, creating lots of exceptions and edge cases.

A secondary problem is that having this grouping term effectively shifts many other core grouping terms (e.g. tRNA, snoRNA, snRNA) down one level compared to 'lncRNA' and 'rRNA', making for a unhelpful visualisation of the ncRNA hierarchy/tree.

This proposal reflects a consensus amongst members of the RNAcentral consortium.

murphyte commented 2 years ago

Note while I did comment I'm not a fan of the sncRNA term on the RNAcentral doc, getting rid of it does pose a different problem. The original request for the term came from me, on #485. The problem is that without it, there's a set of short ncRNAs that defy classification. Sure, you can just call them ncRNA, but then you can't tell if it was simply a matter of laziness (it belongs to a known more specific class, but wasn't classified) vs lack of a suitable child feature type.

And realistically, if sncRNA is a bad term, then lncRNA is equally bad.

We could shift the children up a level, and redefine sncRNA as "small ncRNA less than 200 nucleotides that doesn't belong to another ncRNA class"

sjm41 commented 2 years ago

Hi @murphyte Apologies if I over-interpreted your comments on the discussion doc and libeled your previous #485 ticket! :-(

We could shift the children up a level, and redefine sncRNA as "small ncRNA less than 200 nucleotides that doesn't belong to another ncRNA class" I think I'd be happy with that suggestion. Then anything that is directly annotated to 'sncRNA' will remain with that annotation and thus benefit from the extra info that it's an unspecified "short" (rather than a "long") ncRNA - sounds like you have significant examples of this. At the same time, direct annotations to tRNA, snoRNA, snRNA (etc) will be maintained but uncoupled from a strict length dependency, whilst also becoming more 'visible' in the hierarchy (same level as rRNA, lncRNA and sncRNA).

(Note that we are also trying to come up with a better definition of lncRNA than the current "A non-coding RNA over 200 nucleotides in length", which defines them solely by length and creates similar exception/consistency issues...)

murphyte commented 2 years ago

To be honest, I forgot I was even the one who asked for this when I commented on the Google doc! We haven't gotten around to using this term, mostly through laziness. The way our code is set up, there's some risk of us annotating a lncRNA feature that is <200 nt. And honestly I'm fine with that, since the size cutoff annoys me and there's no reason to think that non-descript non-coding RNAs that are generally polII-expressed, poly-adenylated, and often spliced are inherently different if they're longer or shorter than 200 nt.

The whole child and definition issue is also a bit of a problem. First, there are papers like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6046292/ that use sncRNA to specifically include tRNA, snoRNA, miRNA, etc. Second, 731 experiments come up in SRA with a search for sncRNA. That's not much by any means, but it does tend to get used for referring to size-fractionation sequencing going after the <200 nt class, and presumably includes the specific types in addition to any non-classified sncRNAs.

So really the main issues are how applying a size cutoff in a definition is incredibly unsatisfying (and applies to both sncRNA and lncRNA), and that having different levels of binning makes the more specific types less noticeable in the web view, and could result in some users not pulling the types because all they do is pull one level of child features under ncRNA. Is that a good requirement for an ontology?

If lncRNA wasn't so well established and I'd say get rid of both of them and come up with a different term. I don't know what to do about it. Maybe I'll have some inspiration and weigh in more later.

sjm41 commented 2 years ago

I think the main problem is that the current SO has both these terms (and problematic defs): sncRNA: A non-coding RNA less than 200 nucleotides in length. lncRNA: A non-coding RNA over 200nucleotides in length. This creates an apparent dichotomy where all ncRNAs could/should be annotated under one of these two terms, leading to consistency problems both within the ontology (parent-child relationships) and/or for annotation of individual ncRNAs. E.g. Most/all 28S rRNAs are >200nt and 5S rRNAs <200nt, but I doubt anyone would say the corresponding SO terms should be children of 'lncRNA' and 'sncRNA', respectively. Equally, it would currently be correct to co-annotate a 28S rRNA with both the 28S and the lncRNA SO terms, and a 5S rRNA with the 5S and sncRNA SO terms (though I wouldn't do so!).

As you say, we can't get rid of the lncRNA term, but a better definition would help, perhaps: "A non-coding RNA over 200 nucleotides in length that cannot be classified as any other ncRNA subtype. Similar to mRNAs, lncRNAs are mainly transcribed by RNA polymerase II, are often capped by 7-methyl guanosine at their 5' ends, polyadenylated at their 3' ends and may be spliced. (PMID: 33353982)"

Given 'sncRNA' hasn't been used by NCBI, meaning the SO term is currently a grouping class in the SO rather than used for direct annotation, I think the original suggestion of getting rid of that term may still be the best (least worst) solution here. Having just a single length-based term in the SO (lncRNA), with the new definition above, should solve the current false dichotomy and multiple parentage/co-annotation issues, I think.

Only downside I see is that something that could be annotated as 'sncRNA' would have to annotated to the parent 'ncRNA', but I don't see that as a significant issue.

==

FYI, here are two quotes from recent-ish papers on the classification problem:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3851203/ (2013) Attempts to untangle the complex landscape of ncRNAs have led to crude classification of ncRNAs based on their length (small, 18-31nt; medium, 31-200nt; and long, >200nt) [11], function (housekeeping ncRNAs such as ribosomal (rRNAs), transfer RNAs (tRNAs)), regulatory potential (microRNAs (miRNAs), long non-coding RNAs (lncRNAs)) [12], and subcellular localization (small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), cytoplasm-located piwi-interacting RNAs (piRNAs), and short interfering RNAs (siRNAs)). Other unusual ncRNA species such as trans-spliced transcripts, macroRNAs that encompass enormous genomic distances, and multi-gene transcripts that encompass several genes or even the whole chromosome further confound efforts for systematic classification [13-15]. In reality, however, clear categorization of ncRNA classes has been quite difficult, as many ncRNA transcripts often share the properties of multiple categories.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6497742/ (2019) Currently ncRNAs can be defined by length – small 18–200 nts and long >200nts – or functionality with housekeeping ncRNAs such as ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) or regulatory ncRNAs like microRNAs (miRNAs), small nuclear RNAs (snRNAs), piwi-interacting RNA (piRNAs), tRNA derived small RNAs (tsRNAs) and long non-coding RNAs (lncRNAs) (Dozmorov et al., 2013). Nonetheless, difficulty distinguishing categories persists due to the crossover of properties.

Interesting (frustrating) that both these quotes give 'lncRNA' as an example of an ncRNA class based on function rather than length, but then don't go on to provide a 'functional definition' of the term...

murphyte commented 2 years ago

How about describing lncRNAs as "generally longer than 200 nt", leaving the door open to use it for shorter ncRNAs that nonetheless fall in the same bucket of ncRNAs not classified as anything else?

sjm41 commented 2 years ago

Works for me. Then we can annotate that pesky 199nt ncRNA as an lncRNA after all :-)

egchristensen commented 2 years ago

@keilbeck any comments on the current size cut-off for snc/lncRNAs?

keilbeck commented 2 years ago

It seems like sncRN is still a thing - https://www.nature.com/articles/srep20126. Could we remove the size from the definition to remove the false dichotomy? Or say something like: typically <100 nucleotides long?

Happy to change the definition of lncRNA.

sjm41 commented 2 years ago

Note that we have already updated the lncRNA definition in #575 to this: A non-coding RNA generally longer than 200 nucleotides that cannot be classified as any other ncRNA subtype. Similar to mRNAs, lncRNAs are mainly transcribed by RNA polymerase II, are often capped by 7-methyl guanosine at their 5' ends, polyadenylated at their 3' ends and may be spliced.

So I think that aspect is done.

sjm41 commented 2 years ago

Current SO defs for sncRNA/sncRNA_gene are:

sncRNA (SO:0002247) A non-coding RNA less than 200 nucleotides in length. sncRNA_gene (SO:0002342) A ncRNA_gene that encodes an ncRNA less than 200 nucleotides in length.

So we could change these to the following to match the current lncRNA/lncRNA_gene defs: sncRNA (SO:0002247) A non-coding RNA generally shorter than 200 nucleotides. sncRNA_gene (SO:0002342) A gene that encodes a short non-coding RNA.

Outstanding question (mentioned above) is whether the sncRNA def should include the "...that cannot be classified as any other ncRNA subtype" bit or not. That is, should sncRNA continue to be the parent of snRNA, snoRNA, tRNA and small_regulatory_ncRNA or not.