The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

fix def SO:0001877 lncRNA #611

Open ValWood opened 1 year ago

ValWood commented 1 year ago

What is the SO term name and accession?

SO:0001877 lncRNA

non-coding RNA generally longer than 200 nucleotides that cannot be classified as any other ncRNA subtype.

Describe what you would like to change.

Please remove the word "generally".

In contrast sncRNA is defined

A non-coding RNA less than 200 nucleotides in length.

This means that logically an lncRNA can never less than 200 (or it would be a snc) and the word generally is just making the definition imprecise.

@sjm41 FYI

murphyte commented 1 year ago

See the discussion on https://github.com/The-Sequence-Ontology/SO-Ontologies/issues/572. The proposal there is to remove sncRNA completely, and include the word "generally" in the lncRNA definition because the lncRNA is used more for the "cannot be classified as any other ncRNA subtype" aspect of the definition and it just so happens that most (but not all) RNAs that can't be otherwise categorized are above 200 nt. The size threshold introduces complexity in the SO hierarchy that isn't manageable or useful.

ValWood commented 1 year ago

I am not seeing that conclusion in the ticket?

I am not a fan of the arbitrary snc /lnc size distinction, but:

  1. the snc/lncRNA distinction does seem to be adopted by the ncRNA community.
  2. This solution is problematic in that it leaves the issue that there are small non-coding RNAs that do not fit sn/snoRNA which it would seem very strange to call these "lncRNAs"
  3. Including the word 'generally' in a term seems problematic because the definition is not precise. It would be better to exclude any size qualifier and just say "a non-coding RNA that cannot be classified as any other ncRNA subtype" because in this scenario the information about size becomes irrelevant. Or to add this after the differentia as a comment to explain clearly that lncRNAs include ncRNAs under 200 nt that cannot be classified as a type of sncRNA (but that sounds a bit bonkers)

Anyway, the community classification is not good, but the solution here isn't great because putative sn and and RNAs will be classified as lncRNAs.

sjm41 commented 1 year ago

(original ticket to improve lncRNA definition is #575.)

@blakesweeney just alerted me to this new 'consensus statement' paper that tries to define and classify lncRNAs better: https://www.nature.com/articles/s41580-022-00566-8

Relevant passages: lncRNAs have the unfortunate distinction of being named for what they are not, rather than what they are.

lncRNAs have been arbitrarily defined as non-coding transcripts of more than 200 nucleotides (200 nt), which is a convenient size cut-off in biochemical and biophysical RNA purification protocols that deplete most infrastructural RNAs, such as 5S rRNAs, tRNAs, snRNAs and snoRNAs, as well as miRNAs, siRNAs and piRNAs. This definition also excludes some other well-known short RNAs such as the primate-specific snaRs (~80–120 nt), which associate with nuclear factor 90; Y RNAs (~100 nt), which act as scaffolds for ribonucleoprotein (RNP) complexes; vault RNAs (88–140 nt), which are involved in transferring extracellular stimuli into intracellular signals; and promoter-associated RNAs and non-canonical small RNAs produced by post-transcriptional processing. Other non-coding RNAs lie close to the 200-nt border, such as 7SK (~330 nt in vertebrates), which controls transcription poising and termination, including at enhancers, and 7SL (~300 nt), which is an integral component of the signal recognition particle that targets proteins to cell membranes and the evolutionary ancestor of the widespread primate Alu (~280 nt) and rodent B1 (~135 nt) small interspersed nuclear elements.

Given this grey zone of sizes, we support the suggestion that non-coding RNAs be divided into three categories: (1) small RNAs (less than 50 nt); (2) RNA polymerase III (Pol III) transcripts (such as tRNAs, 5S rRNA, 7SK, 7SL, and Alu, vault and Y RNAs), Pol V transcripts in plants and small Pol II transcripts such as (most) snRNAs and intron-derived snoRNAs; (3) lncRNAs (more than 500 nt), which are mostly generated by Pol II.

In the absence of more specific categorization, we recommend retention of the general descriptor ‘lncRNA’ for non-coding RNAs greater than 500 nt in length.

I'm not sure yet if some/all of these proposals would work well within the SO....

ValWood commented 1 year ago

It also has the potential to muddy the waters by bringing in two more arbitrary size classifiers. Why 500? Why 50? What about those in between?

Fission yeast researchers routinely use 200 nt cut-off for lncRNA. I'm all for better classification, but I don't think we should jump to this one.

Classifying by polymerase might be OK, but aren't some (or all)? 5S RNA snRNA and snoRNA transcribed by pol II?

murphyte commented 1 year ago

Thanks for that reference!

The wording is fuzzy, but I think bin (2) is intended to be defined by size 50-500 nt, including what's listed but also (not clearly said) other small Pol II transcripts in that size range. Try adding a comma: "and small Pol II transcripts ,comma, such as (most) snRNAs and intron-derived snoRNAs." Moving the threshold up to 500 does work to include nearly all of the shortish ncRNAs of well-characterized types, but increases the issue for lncRNA vs not-as-long lncRNA (nalncRNA?), and probably creates some headaches for RNA types that fall in the 40-60 nt range.

Trying to fit this into the SO hierarchy, with 1-50, 50-500, and 501+ nt parent terms, would still be challenging. There would be some benefit to having a parent term for the set of known short ncRNA types one is likely to isolate with a short size fractionation, but some types would need to have multiple size parents which complicates traversing the hierarchy. So I don't know if there's really a benefit.

The term lncRNA is very entrenched in the literature, and there isn't much in the <200-nt range that isn't defined by other terms. I continue to think that dropping sncRNA entirely and allowing the minimum size for lncRNA to be fuzzy comes close to typical usage while de-emphasizing size as the defining characteristic, when lncRNAs are really just a catch-all for "not otherwise defined".