The-Sequence-Ontology / SO-Ontologies

Collect of SO Ontologies
Creative Commons Attribution 4.0 International
96 stars 37 forks source link

Clarification on definition of gene #495

Open cmungall opened 4 years ago

cmungall commented 4 years ago

Currently the definition of gene http://purl.obolibrary.org/obo/SO_0000704 is A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions. http://www.sequenceontology.org/browser/current_svn/term/SO:immuno_workshop

The URL doesn't resolve.

I realize this is a whole can of worms, finding the perfect definition is hard, but it may be useful to provide additional examples of what is meant and not meant.

I seem to recall SO previously having a definition that was more precise operational, and specifically mentioned that in the case of coding genes there would be no in-frame overlaps. Did I hallucinate this?

I don't want to re-tread old ground if the definition was agreed upon by a wide variety of people, but it would be useful to have links out to other documentation (e.g. how gene is applied in GFF3, how start/end is determined, how uniqueness is determined)

E.g. https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#the-canonical-gene

(which specifically excludes TFBSs)

I'm particularly interested in how to apply SO's definition for gene in the case of RNA viruses such as SARS-CoV-2. Do nsp1, nsp2, etc come from distinct genes? Are pps 1a and 1ab distinct genes? Or is there a single gene that encompasses 1a and 1ab? More on how this genome is treated by gene/protein databases here: https://douroucouli.wordpress.com/2020/08/05/what-is-the-sars-cov-2-molecular-parts-list/

davidwsant commented 3 years ago

Hi @cmungall,

I can understand wanting to update the definition of gene. As discussed with Colin Logie for the BBA paper, the SO gene includes the regulatory regions that are inherited as well. The phrase ' all of the sequence elements necessary' does sound ambiguous as it might include proteins (transcription factors). Is there a specific wording that you would like to use in place of this?

Of note, I am not familiar enough with the GFF3 format to change this. However, the definition in SO has included regulatory regions of a gene for several years and this was covered in the BBA manuscript, so I believe the definition should continue to include regulatory regions. We will have to have another discussion to determine if GFF3 needs updates. I will bring it up to the group, but if changes are needed they are likely to go in to the GFF3 GitHub Page and will be addressed by someone more familiar with updating the GFF3 format.

Thanks,

Dave

murphyte commented 3 years ago

I wouldn't say that usage in GFF3 precludes inclusion of regulatory sequences. While common usage is to define the gene range based on the maximal transcript range, that's partially a matter of convenience and lack of information about promoter elements. promoter (SO:0000167) and regulatory_region (SO:0005836) are children of gene, so it's currently valid to include them. And fun tidbit: there's nothing in the GFF3 spec that says a parent feature location needs to encompass the location(s) of its children.

(which specifically excludes TFBSs)

The canonical gene example includes a TF_binding_site feature (line 3).

I don't have any suggestions on how to improve the definition, other than to say it's a thorny subject and gene has such varied use that any changes to make it more restrictive or specific are likely to be incompatible with current usage.