Gene and Gene Group / Locus Glyph

shyambhakta commented 3 years ago

It has historically been common to simplify the depiction of an entire gene, operon, or gene cluster as a single glyph that is equivalent in form to that of the current SBOL CDS glyph.

Wild-type resistance genes, e.g., Kanᴿ and Chlᴿ (kanamycin and chloramphenicol resistance) are labeled with the phenotype imparted, but popularly shown with just a CDS glyph (Fig A). The inclusion of more detailed promoter, RBS, and terminator glyphs makes for undesired clutter, and I find it very rare that these control elements are annotated precisely even in sequence files. As resistance is a phenotype, it can't technically be the label for just a CDS, which would instead require the CDS names, e.g., aphA1/nptI/aphA2/nptII or cat for these resistance genes. Many possible CDSs may encode proteins that confer the same resistance phenotype, but because these CDS names are less familiar, often not even known by the users, it's often preferred to write the more practical resistance phenotype.
This concept often extends to transcription factor genes, e.g. lacI^q which is a stronger promoter mutant of the wild-type lacI gene, and thus cannot describe just a CDS (Fig A). Another case is a constitutive arabinose transcriptional activator + transporter operon araCE, which may be represented by just a single CDS glyph labeled ara or araCE. It's not clear that an SBOL CDS glyph can be used more broadly for such whole genes.
The concept also often extends to larger operons and gene clusters, e.g., native cytochrome c maturation gene cluster cloned into a plasmid under synthetic control (Fig A). But instead of taking a bunch of space with eight CDSs in order labeled ccmA ccmB ccmC ccmD ccmE ccmF ccmG ccmH, one may want to simplify it to the equivalent of a single CDS glyph labeled ccm or ccmA–H or ccmABCDEFGH, which represents the whole translational unit, or possibly the whole operon / gene cluster, from promoter to terminator, especially when using the native control elements. Another example: sets of plasmid partitioning genes used to stabilize plasmid inheritance are often named with the prefix par, and their operon(s) (e.g. parABCD) and their associated promoters/terminators are often simplified to the equivalent of just one CDS glyph labelled par, in the primary direction of transcription if there is one (Fig A). Users see it and know "this is the partitioning locus" without caring about the finer details.
Even more abstractly, the CDS glyph is used to represent a genetic locus, with the direction of the arrow representing the direction of transcription of a key gene, as in the sole replicator protein in many replication origins (Fig B–D), or perhaps the direction of DNA replication if unidirectional from an origin, or even an arbitrary direction, which is at least useful in establishing directionality of the corresponding defined sequence, something that a rotationally-symmetric origin glyph "○" doesn't capture. For any arbitrary locus, one ought to be able to refer to nucleotide indices within it, with a corresponding SBOL glyph making evident where the beginning and end is, and the glyph length ought to be able to be scaled for longer and shorter loci, things that SBOL glyphs ought to have options for in representing, for example, a replication origin or future glyphs for other specific kinds of genetic loci. If I were to say the helicase-binding box of pSC101 is within the first 100 nt of the sequence, Fig B doesn't make it clear relatively where that is, whereas Fig D does and Fig C would if it were described as in the oriV.

So what can we decide is the recommendation for a glyph of a whole gene (single gene or operon), and/or an entire gene cluster / "locus" with all its tx/tl initiation/termination elements? Some people write it in a box instead of an arrow, but this isn't a general solution as gene direction is still useful to describe in many cases, hence the other traditional use of the CDS arrow/pentagon. I'm not sure such genes necessary qualify as "engineered regions" to be able to use a box; they're often natural sequences used as a whole for a phenotype, part of wanting to abstract into a single glyph in the same way the many genes with distinct promoters, CDSs, ncRNAs, etc. mediating replication_origin function are abstracted into a single origin glyph ○.

The specs of the CDS glyph shows association with only CDS SO:0000316: A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon. The far broader term "gene" is the proposition for what the popular usage of the CDS glyph is also describing in the earlier examples, gene SO:0000704 A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions. Its children include gene_with_polycistronic_transcript and gene_cassette, and so cover some of the examples above. However, the usage encompassing multiple genes in multiple operons/directions would require another level up, gene_group SO:0005855**, A collection of related genes.

A dual-use of the CDS glyph for both a real start-stop codon CDS and entire gene or gene_group is seemingly contradictory, and I'd expect contention on the matter if the consensus is to allow it. And if not, the design of a new gene / gene_group glyph could also be controversial since it violates a very old tradition, which might suggest that a new glyph would have to be at least quite similar to the CDS glyph.

jakebeal commented 3 years ago

There is a very similar issue for the Promoter glyph. At one extreme, promoter is sometimes used more abstractly to refer to the whole region, including regulatory binding sites, spacer material, TATA box / -35 site, and transcriptional start site. At the opposite extreme, I have seen promoter sometimes used to indicate precisely and only the TSS. And, of course, there are various levels of detail in between, when somebody wants to talk about some of these aspects but not others.

In the case of promoters, SO has at least partially resolved this for us, because it has defined that, for example, the -35 site (http://www.sequenceontology.org/browser/current_release/term/SO:0000176) is a child of Promoter (http://www.sequenceontology.org/miso/current_release/term/SO:0000167). The basic problem is this there, however.

As I think about this, I find that for myself at least, the big divide is between transcripts and things that control transcription. The reason is that this tends to be a critical point when qualitatively changing the function of a biological system. Thus, the most core glyphs being used are:

A thing that controls transcription: Promoter
A thing that gets transcribed: Coding Sequence --> generalize to Transcript (SO:0000673)?
A thing that includes both: Engineered region --> generalize to Region (SO:0000001)?

This would put the resistance genes using the rectangle for Region and the polycistronic cytochrome using a pentagon for Transcript. This would also be consistent with the recent work on introns (#63) and 2A regions (#78).

Some additional notes:

Engineered Region shouldn't be generalized all the way to Sequence Feature (SO:0000110) because that includes zero-length elements and also because that is used for the unspecified glyph, to denote things that are missing information.
Transcript would mean the pentagon also covers ncRNA, which has its own glyph. I think that may be OK, however, since choosing to use a more specific glyph is always allowed.

jakebeal commented 3 years ago

Looking at this again, I'm noting that CDS (SO:0000316) isn't actually a subtype of Transcript (SO:0000673), which it actually has a part_of relationship with. I'd thus propose that CDS get Transcript as a second semantic value rather than replacing CDS as the value.

I would still suggest that Engineered region be replaced by Region since that is a strict is_a relation, and the glyph is useful for indicating regions that aren't engineered.

jakebeal commented 3 years ago

Note: several other glyphs also have multiple semantic values, e.g., https://github.com/SynBioDex/SBOL-visual/tree/master/Glyphs/stop-site

toddslaby commented 3 years ago

I propose a sawtooth version of the Recommended CDS glyph for a polycistronic (single transcript) coding sequence mimicking head-tail-head-tail CDS arrangement and to work with a singular promoter and singular terminator.

The test case in my mind is something like a purely synthetic operon like the isobutanol-producing operon from the Atsumi Lab. In this example you can’t name it like parABCD or araCE to bastardize the CDS glyph because every coding sequence (CDS) has a unique gene name.

As for markers and the like, Scientists have a bad history of poor documentation of markers. Every time I go to a new company/lab I have to work to define selection markers/gene cassettes to include Promoter-CDS-Terminator in the scientist’s annotation. It’s pretty common to have different terminal sequences across companies and labs.

As for the promoter glyph being inconsistently used, in the original days it meant a TSS. Then very early synbio folks appropriated it for promoters. I think we just have ourselves to blame :)

I think I’m seeing some conflation of DNA function with protein or cassette function in the glyphs. These annotation layers in my opinion need to be functionally separated in VSBOL more. If I’m a scientist I’m designing at different levels at different stages of my planning/execution; a machine might be able to design across all three or four levels better than me. But maybe I’m not up to date on my VSBOL usage.

I’ve been out of the loop so please throw out anything irrelevant or problematic.

rsc3 commented 3 years ago

This issue is common for every symbol that has complex underlying structure. I don’t think creating a new glyph is the way to go here. I think the CDS glyph should be able to be used to mean gene, depending on the context. I also think the promoter symbol should mean a promoter or (one of) the transcriptional start sites, depending on context. I know context isn’t what you want to worry about in a design language but this is how scientists have used these symbols in the literature. Just compiling more glyphs isn’t going to make the language more expressive—it’s going to prevent anyone from using it. Add a glyph that can also mean “this is a gene and NOT a CDS”. But allow the pentagon (or any of the glyph variants) to also mean gene.

toddslaby commented 3 years ago

Why would one want to specify a “gene NOT CDS” when there’s CDS and ncRNA that cover all situations?

I firmly disagree that “how scientists have done it” is a valid justification for confusion in a visual ontology. “Gene” is the most nefarious designation as every college level genetics programs will point out: a “gene” means different things to different educated people and has historically been defined functionally which leads to imprecise boundaries and discontinuities at the sequence level about what a gene actually is”.

I suppose that is why we vote :)

jakebeal commented 3 years ago

I'd be interested to hear what people think about the question of splicing and introns. I think this might be a revealing example to discuss to clarify the question.

When I work with a "coding sequence" for a product in a eukaryote, I'm often in a state of ignorance about whether there's any introns and whether there's splicing going on to produce the protein that I get out. So I'm thinking about it like a coding sequence (since I'm focused on the product), but really, I'm working with a transcript that may include things other than pure coding sequence.

This, plus the comments that you've both made, motivates me toward thinking about making the following changes:

Add Transcription Start Site (SO:0000315) as a second semantic value for the promoter / bent-arrow glyph.
Add Transcript (SO:0000673) as a second semantic value for the CDS / pentagon glyph
Change Engineered Region to Region (SO:0000001)

Notice that this combination would let us avoid having to explicitly engage with ambiguous notions like "gene" and "gene cluster".

jakebeal commented 3 years ago

Bumping this issue; @rsc3 @toddslaby @shyambhakta - thoughts on introns or the proposal I've made as a potential way to navigate the issues that you all have raised?

toddslaby commented 3 years ago

It could work but I still think as a scientist we need a new glyph for polycistronic:intron-containing transcripts. TSS is a necessary annotation to design guide RNA expression cassettes so I’m surprised it hasn’t come up yet.

When I was working in C. elegans I was designing plasmids that had both coding sequences with introns (for expression in C. elegans for the experiment) and coding sequences without introns (for expression in bacteria for cloning operations). Some yeast constructs and fungal constructs also require introns for some conponents.

The original assumption seemed to have been that synthetic devices would be species-specific despite (shuttle) plasmids being the first synthetic devices and also multi-specific.

Since it looks like SBOL visual is moving towards representing all parts of a plasmid/vector then visually representing both situations with a new glyph would be incredibly clarifying still. Otherwise it would be incredibly difficult to design a shuttle vector with SBOL visual. And that makes sense: throughout my years, most scientists were just reusing other scientists devices and taking the construction materials used to enable experimentation for granted.

jakebeal commented 3 years ago

If we use the current glyphs, then the approach to represent polycistronic and intron-containing regions would be like these, respectively:

Now, both of these approaches require being specific about the number of coding sequences or introns. What are the shortcomings that you see in this representation?

toddslaby commented 3 years ago

Some naturally occurring microbial operons have a mix of partially overlapping coding sequences and non-overlapping coding sequences.

Synthetic microbial operons tend not to have these overlapping bits because we don’t understand how to control the expression of the downstream gene as well IMHO. So you have ncRNA bits between CDSs with RBSs in them.

the first glyph misrepresents the naturally occurring operons some researchers and companies routinely utilize.

Eukaryotic introns for enhanced protein expression are commonly engineered at some point and their presence here is implied. There also could be zero association between Exons. What’s the rationale for not showing splice junctions in the glyph? Also, Is the intron containing glyph also how CRISPR repeats would be documented at the DNA level?

jakebeal commented 3 years ago

Thanks, @toddslaby , those are definitely some good points! The cases that you raise are simply things that haven't been thought about how to represent well with SBOL Visual.

Do you want to make some suggestions for how these might be effectively diagrammed?

rsc3 commented 3 years ago

I agree with everything @toddslaby said about introns and operons.

I think the “transcriptional unit” is the most fundamental unit of expression. If you can represent TU along with CDS (and exon) simultaneously the problem is mostly solved. You also need to represent stacks of multiple TSS’s (digging into promoter structure) since this is how new sequencing methods are measuring them.

The genome browsers have gone a long way towards building this symbolic logic. Why don’t we complete a survey (IGV, ucsc, etc) and use the results to inform this choice?

jakebeal commented 3 years ago

@rsc3 Would you like to organize that? I think that if you're up for it, it would be an excellent resource.

In the meantime, what do you think about the changes that I propose for near-term partial and likely-compatible solution? Reminder, those are:

Add Transcription Start Site (SO:0000315) as a second semantic value for the promoter / bent-arrow glyph.
Add Transcript (SO:0000673) as a second semantic value for the CDS / pentagon glyph
Change Engineered Region to Region (SO:0000001)

Gonza10V commented 8 months ago

Hi, I would like to retake this discussion. From the last @jakebeal comment:

Add Transcription Start Site (SO:0000315) as a second semantic value for the promoter / bent-arrow glyph. I agree with this. And I would like to know if it can be expanded for example on my new promoters designs in order to better insulate the expression I add a terminator, upstream sequence, the promoter and a ribozyme, all that for me is a promoter now. Similat to B0015 that is a terminator composed of two terminators can a promoter be composed of accesory parts like the mentioned before?
Add Transcript (SO:0000673) as a second semantic value for the CDS / pentagon glyph I disagree with this, maybe this is too old or I'm not understanding this. The transcript is an RNA the CDS represents a DNA sequence, therefore transcript can be represented as a ssNA or a macromolecule. What is the representation of transcript now?
Change Engineered Region to Region (SO:0000001) This would be very useful for me as I want to represent antibiotic resistance encoded in phenes. Genes and phenes are commonly represented as both CDS glyphs. To leave the CDS as is, I propose to use the alternative CDS glyph, the arrow, to represent genes and phenes. We had a brief discussion about this with @shyambhakta on the SBOL slack channel (Join here). The use of the engineering region rectangle to represent regions in general would include phenes and genes, and solve my problem, but I think it would be difficult for researchers with a background in genomics to understand it. Are there more alternatives? Any thoughts?

fxbuson commented 8 months ago

Some thoughts on this:

Wouldn't a DNA location (SO:0000699) be more precise to represent a TSS?
I would argue the promoter glyph already encompasses uses such as what @Gonza10V suggested, since a composite promoter glyph would correctly indicate a promoter region with other elements.
For extensions of the CDS glyph, I see a problem in adopting Transcript (SO:0000673), because it describes an RNA sequence, or Transcription Unit (SO:0002301), because it only defines the region between transcription start and termination. This way a resistance gene in a plasmid wouldn't be accurately represented because it should also encompass the promoter region (before the transcription start site).

jakebeal commented 8 months ago

Wouldn't a DNA location (SO:0000699) be more precise to represent a TSS?

I agree that DNA location could be used to represent a TSS. However, there are many diagrams that specifically use a promoter-style bent arrow to represent the TSS. Indeed, one could argue that historically the promoter glyph was really a co-option of the TSS usage. That doesn't mean we have to go that way, if we find it incoherent, but it means it would not be unreasonable.

The use of the engineering region rectangle to represent regions in general would include phenes and genes, and solve my problem, but I think it would be difficult for researchers with a background in genomics to understand it.

Why do you think that it would be difficult? "Rectangle can be used for anything that doesn't have a more specific glyph" seems pretty simple to me.

shyambhakta commented 8 months ago

I mentioned in the main post, point #4 that the direction of the transcriptional unit matters a lot, but it also applies to all the other examples in (A). It would be incomplete to just see "engineered region" boxes with transcription factor names, resistance marker names, and locus names without some sort of associated orientation. I mentioned in the post that they're often not "engineered", hence abstracting the promoter, RBS, CDS details as they were not selectively composed to care about their identities.

Also, a gene/phene glyph, must generally include a promoter, as it encompasses a transcriptional unit (however, in (A), I showed how it's also used to denote a series of separate CDSs). Phenes don't manifest without the promoter. The transcriptional unit / gene lacI^q I show in (A) means "constitutive mutant of lacI, and what is constitutive? the promoter, which is included in the abstraction of the glyph. Terminators are not always present in the natural span or synthetically added, as transcriptional insulation is a contextual need in natural and synthetic contexts. So a gene/phene glyph would be promoter through CDS, ±terminator. If we care to include the usage in the ccm operon example in (A), the promoter could also be optional, at which point the gene/phene glyph would be something like "transcriptional and/or translational unit" = all cis-regulatory parts contributing to transcription or translation at a mono- or polycistronic locus.

In (B)–(D), I show how glyphs like the replication origin lacking orientation information have ambiguity that is perhaps unacceptable in the long run: I believe it should be an expected goal of SBOL Visual to be able to have a one-to-one correspondence of sequences to labeled glyphs, but then shouldn't one be able to look at a diagram and match part sequences associated with glyphs back to recreate the DNA? If glyphs are rotationally symmetric, you wouldn't know what orientation to recompose parts or genes/phenes, in actual practice or in one's mind when trying to take in what the DNA program is behind the glyphs that abstract it… what are the courses of RNA polymerases and ribosomes across the circuit?

One reason I like to have the CDS pentagon sit atop the DNA backbone instead of the middle is to emphasize the orientation. This can also be done with the engineered region box, but the pentagon's arrow contributes a lot to seeing the direction of transcription, and a box doesn't. It doesn't help that diagrams often do a horizontal reflection to show orientation, which often works given context clues of the typical promoter-RBS-CDS-terminator sequence of glyphs. But, say, a terminator between a replication origin and a resistance marker in the same format gives no clues as to which orientation the terminator is — which part is being insulated from which? one may then ask.

I think the reason the historical use of the CDS pentagon/arrow broadly for CDSs, genes, phenes, operons, loci, any feature is that it hasn't been confusing. Benchling shows every feature as "CDS" pentagons proportional to sequence length — but really, it's just a scalable labeled box made directional/arrow-like. The start–stop span CDS has been only the narrowest of usages. If you look at my example (A) which has three distinct non-CDS uses of the pentagon, it's still not confusing. Furthermore, it perfectly allows the one-to-one glyph-to-sequence requirement I proposed, one that an engineered region box and a replication origin circle (B)-(C) fail unless represented by something directional like a pentagon (D), which doesn't look the slightest bit wrong or misleading even though it's a replication origin, not a CDS.

All I know is that pentagons are way more than CDSs in the world outside SBOL. It probably(?) won't be a viable option to expand the role of the pentagon and make CDSs more distinct. Arrows are too often synonymous with CDSs to perhaps formalize as gene/phene/locus. A new locus glyph would need to be box-like, as the pentagon is, to make it scalable to sequence length; and it needs to clearly show and evoke direction in its shape. Perhaps different box ends can be explored: round ended box, inward pointed pentagon…

I need to write my thesis and try not to respond for two weeks 😅

Gonza10V commented 8 months ago

I think that we should deal with some parts of the discussion in separete issues. Let's try to solve gere the gene, gene group and locus gluph here and if you want @shyambhakta you could open later the origin of replication issue and the CDS glyph on top of the DNA line issue, which might be related to #167. @jakebeal is your point: "Add Transcription Start Site (SO:0000315) as a second semantic value for the promoter / bent-arrow glyph" still a need? if it is could you open or point me an issue with it?

Now the CDS has the box with one side bent out arrow-like and the block arrow glyphs. In my background as biochemist I often saw both glyphs now in use for the CDS to represent genes and transcriptional units.

Using the following image from practitioners paper: Screenshot 2023-11-07 at 13 45 26

The block arrow is used to represent genes or transcriptional units with the block arrow and groups of genes with boxes similar to engineered-region as that part has no single direction.

A fair solution solution could be to remove the block arrow as alternative CDS representation and use it to cover gene (SO:0000704) and unit of gene expression (SO:0002300) with known directionality and the box for region (SO:0000001) with unknown or multiple directionality parts. If the box includes region, as @jakebeal mentioned before, then it can be used to represent biological regions and therefore a gene group.

Would something like this help in this regards? any thoughts?

I'm back to my thesis writing as well 😅 cya

SynBioDex / SBOL-visual

Gene and Gene Group / Locus Glyph #113