SynBioDex / SBOL-visual

The reference implementation of the SBOL Visual standard
Other
32 stars 16 forks source link

Usage of protein feature glyphs in combination with CDSs #167

Open shyambhakta opened 11 months ago

shyambhakta commented 11 months ago

How are protein-stemmed glyphs to be used in conjunction with CDS/CDS domain glyphs? Should CDS glyphs be required to represent the full start->stop codon open reading frame?

The polypeptide-stemmed glyph indicates a feature that manifests in the polypeptide form. Extended from this stem, the protein cleavage site glyph might represent a TEV protease site and a protein stability element glyph might represent a solubilization domain or(?) a degradation tag². Polypeptides are encoded by CDSs (Coding DNA Sequences), a DNA feature¹. The clash naturally arises in how to indicate a protein-stemmed feature within the CDS it is encoded within. The ambiguity might be why they have been some of the least used glyphs.

image

Option 1a): require CDS/domain glyphs to cover the contiguous translation unit, i.e. start to stop codons, AND³ 1b) allow superposition of any protein-stem glyphs with the CDS glyph or any of its domains, as shown in example (A). Rationale: The SO term "CDS" is defined as "A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon." Furthermore, the CDS pentagon/arrow glyph (or a rectangle) has historically been used to faithfully encompass start–>stop codon translated reading frame spans in biology and syn bio well before SBOL's formalization of it and to this day. The translational unit is evidently of extreme importance to denote with a single glyph without interruptions, save for domain indicators that subdivide the glyph without breaking it. Surely we cannot violate such basic definitions and norms. In fact, we sort of already decided on the sanctity of the contiguous CDS glyph when we deliberated on the 2A peptide glyph in Issue 78, where we chose dashed lines that don't interrupt the CDS pentagon shape. I actually brought up the present issue and arguments back then: comment (#1) (#2)

Option 2): Allow protein-stem glyphs to substitute domain glyphs as in example (B), and thus allow CDS/domain glyphs to stand for protein-coding segments of DNA without implication of a full open reading frame, i.e., without the implication of beginning/ending with start/stop codons. Rationale: Maybe someone thinks glyph superposition must be avoided and that the CDS definition and norms are better to be revised instead of being respected. I think example (B) is misleading: instinct to see the CDS glyph as a translational unit makes the diagram evoke that the CDS wrongly ends after the purple domain, and that the stability element is a feature that follows the CDS, not is part of it. Also, the cleavage site in the middle of (B) interrupts the interlocking domain shape, which is aesthetically unpleasing.

¹ CDSs may be DNA features, perhaps rationalized as information-storage parts, but CDSs truly only manifest their coding function in the RNA, since that's what the ribosome/tRNA read. 🤔

² The stability-top in general is perhaps so rarely used because the +/– direction of stabilization is hugely important to understanding the function of the part and the reason it is used in a circuit. Positive-stability domains are quite rare in syn bio; I can only imagine enzymes being stabilized by, e.g., an MBP or GST tag. It's counterintuitive that a shield glyph can represent negative stabilization when degradation tags would be the predominant use of the glyph in syn bio. Furthermore, it is pretty easy to misuse/misinterpret the protein cleavage site glyph as a degradation tag, as the X top evokes degradation as well as cleavage. Not to mention, technically, the proteasomal degradation process is a series of many proteolytic cleavages. This matter is for a separate issue.

³ option 1B need not necessarily be in conjunction with 1A. But this would mean that either protein-stem glyphs would have to be deprecated or such glyphs could only be used only in isolation, outside genes/CDSs, e.g., in part plasmids where they are in isolation. There must be some implied SBOL rule that prevents glyph composition from invalidating the usage of a glyph, as would happen when, say, a deg tag part and the protein-stemmed glyph that represents it gets used to build a CDS in a gene: the glyph would become invalid in the composition with other CDS domains, where the CDS domains would take precedent. Hence the option being to permit superposition of the glyphs.

graik commented 11 months ago

Hy Shyam,

good question! The main use case for the protein glyphs that we would think of was to sit on their own "protein line" above / below a DNA. I think in many cases that should look best and is fairly intuitive.

What I could suggest alternatively is to put the protein glyphs on the top line of the CDS box. Technically speaking, protein glyphs should never occur on a DNA baseline (contrary to your examples) as they can only describe things that are actually translated. So there should always be a CDS box to decorate. I have never seen this done though anywhere. So it wouldn't score high in terms of intuitive pattern recognition.

Greetings Raik

On Thu, 2 Nov 2023 at 07:14, Shyam Bhakta @.***> wrote:

How are protein-stemmed glyphs to be used in conjunction with CDS/CDS domain glyphs? Should CDS glyphs be required to represent the full start->stop codon open reading frame?

The polypeptide-stemmed glyph indicates a feature that manifests in the polypeptide form. Extended from this stem, the protein cleavage site glyph might represent a TEV protease site and a protein stability element glyph might represent a solubilization domain or(?) a degradation tag². Polypeptides are encoded by CDSs (Coding DNA Sequences), a DNA feature¹. The clash naturally arises in how to indicate a protein-stemmed feature within the CDS it is encoded within. The ambiguity might be why they have been some of the least used glyphs.

[image: image] https://user-images.githubusercontent.com/5035245/279859749-f713ba27-accf-45b5-848d-afda0dbd9f0d.png

Option 1a): require CDS/domain glyphs to cover the contiguous translation unit, i.e. start to stop codons, AND³ 1b) allow superposition of any protein-stem glyphs with the CDS glyph or any of its domains, as shown in example (A). Rationale: The SO term "CDS" is defined as "A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon." Furthermore, the CDS pentagon/arrow glyph (or a rectangle) has historically been used to faithfully encompass start–>stop codon translated reading frame spans in biology and syn bio well before SBOL's formalization of it and to this day. The translational unit is evidently of extreme importance to denote with a single glyph without interruptions, save for domain indicators that subdivide the glyph without breaking it. Surely we cannot violate such basic definitions and norms. In fact, we sort of already decided on the sanctity of the contiguous CDS glyph when we deliberated on the 2A peptide glyph in Issue 78 https://github.com/SynBioDex/SBOL-visual/issues/78, where we chose dashed lines that don't interrupt the CDS pentagon shape. I actually brought up the present issue and arguments back then: comment (#1) https://github.com/SynBioDex/SBOL-visual/issues/78#issuecomment-625079617 (#2) https://github.com/SynBioDex/SBOL-visual/issues/78#issuecomment-703262724

Option 2): Allow protein-stem glyphs to substitute domain glyphs as in example (B), and thus allow CDS/domain glyphs to stand for protein-coding segments of DNA without implication of a full open reading frame, i.e., without the implication of beginning/ending with start/stop codons. Rationale: Maybe someone thinks glyph superposition must be avoided and that the CDS definition and norms are better to be revised instead of being respected. I think example (B) is misleading: instinct to see the CDS glyph as a translational unit makes the diagram evoke that the CDS wrongly ends after the purple domain, and that the stability element is a feature that follows the CDS, not is part of it. Also, the cleavage site in the middle of (B) interrupts the interlocking domain shape, which is aesthetically unpleasing.

¹ CDSs may be DNA features, perhaps rationalized as information-storage parts, but CDSs truly only manifest their coding function in the RNA, since that's what the ribosome/tRNA read. 🤔

² The stability-top in general is perhaps so rarely used because the +/– direction of stabilization is hugely important to understanding the function of the part and the reason it is used in a circuit. Positive-stability domains are quite rare in syn bio; I can only imagine enzymes being stabilized by, e.g., an MBP or GST tag. It's counterintuitive that a shield glyph can represent negative stabilization when degradation tags would be the predominant use of the glyph in syn bio. Furthermore, it is pretty easy to misuse/misinterpret the protein cleavage site glyph as a degradation tag, as the X top evokes degradation as well as cleavage. Not to mention, technically, the proteasomal degradation process is a series of many proteolytic cleavages. This matter is for a separate issue.

³ option 1B need not necessarily be in conjunction with 1A. But this would mean that either protein-stem glyphs would have to be deprecated or such glyphs could only be used only in isolation, outside genes/CDSs, e.g., in part plasmids where they are in isolation. There must be some implied SBOL rule that prevents glyph composition from invalidating the usage of a glyph, as would happen when, say, a deg tag part and the protein-stemmed glyph that represents it gets used to build a CDS in a gene: the glyph would become invalid in the composition with other CDS domains, where the CDS domains would take precedent. Hence the option being to permit superposition of the glyphs.

— Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-visual/issues/167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOGZXKDFCHCWSXUTS5S3ADYCMMYZAVCNFSM6AAAAAA62IHQCSVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TGNBXGE2DGNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--


Raik Grünberg http://www.raiks.de


fxbuson commented 11 months ago

@graik Would this be what you're suggesting with the glyphs on top of the CDS box? This would be my intuition if I were to represent all my features (DNA, RNA, protein) on the same diagram.

image

In that case, would the glyphs necessarily sit on top of each peptide region (A), assuming a 1:1 relationship, or could they sit anywhere along the line (B), including between peptides?

Gonza10V commented 11 months ago

CDS_parts_representation

I liked @shyambhakta alternative A, then if we implement that I have some questions. Lines inside a TU should be arrows or can be a straight line? In my examples I use straight line as allows it uses less space in the X axis.

A is how I show my designs using the actual standard.

B Then to represent internal components of a part the trivial way would be to just show the part as a component with a line inside separating different parts. This is also compatible with how to represent it in the data standard.

C Including @shyambhakta A) ideas, I could represent them in this way but then if I represent an assembly scar how should it be done.

D Is a mixture between B and C where you just represent sequence features relevant to the image and show that is the composite and all the other details are ommited.

graik commented 11 months ago

Hi Gonzalo,

as a general comment, it looks to me as if we are running here into the quite regular issue of non-practitioners trying to set standards without looking at the current community practices in literature and seminar slides, however confusing they may be. A bit off-topic but, for example, as a protein biochemist / engineer, I really dislike this "helix stem" for designating protein features. I don't think any protein engineer would ever use it or even recognize it as something related to proteins. It also is unnecessarily crowding the space and, besides, most protein features cover a range of residues rather than single sites. And a word of caution: the dimensions in your example are extremely off. Cloning scars and even RBS regions are tiny compared to the average length of a CDS.

Fully on-topic: the most common protein annotation will not be tags or protease "sites" but domains, that is longer functional regions. The IMO best depiction for that is a "pill-box" shape. Whatever you come up with here, needs to start from that. Single-residue "site annotations" should be a secondary concern. First you have to figure out how to draw two domain regions into your protein, perhaps with a protease site in between and two catalytic sites on top of one of them to make it more fun. I think you end up with a situation where it is in fact cleaner, definitely more intuitive, and not much more space consuming to draw a separate protein line above the CDS and start populating this line with the protein features. You could then also have the line zoom in (be longer than the CDS symbol below, with dashed connectors back etc). And then you could just leave away this pescy helix stem :) the symbols can sit directly on this protein line.

Anyway, that would be my suggestion. Starting from that, one could later see whether things could optionally be compressed back onto the top line of the CDS as described in Felipe's image. My guess is that this could work sometimes but, as soon as you start putting exon and cloning scar annotations into the mix, I think it may quickly get confusing.

Just my two cents of course... Raik

On Sat, 4 Nov 2023 at 00:41, Gonzalo Vidal @.***> wrote:

[image: CDS_parts_representation] https://user-images.githubusercontent.com/35148159/280417568-84461a81-b178-423b-b65d-f10709aeb986.png

I liked @shyambhakta https://github.com/shyambhakta alternative A, then if we implement that I have some questions. Lines inside a TU should be arrows or can be a straight line? In my examples I use straight line as allows it uses less space in the X axis.

A is how I show my designs using the actual standard.

B Then to represent internal components of a part the trivial way would be to just show the part as a component with a line inside separating different parts. This is also compatible with how to represent it in the data standard.

C Including @shyambhakta https://github.com/shyambhakta A) ideas, I could represent them in this way but then if I represent an assembly scar how should it be done.

D Is a mixture between B and C where you just represent sequence features relevant to the image and show that is the composite and all the other details are ommited.

— Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-visual/issues/167#issuecomment-1793136298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOGZXJFY6UGC77P2MHZHGTYCVQHFAVCNFSM6AAAAAA62IHQCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGEZTMMRZHA . You are receiving this because you were mentioned.Message ID: @.***>

--


Raik Grünberg http://www.raiks.de


graik commented 11 months ago

PS: Here is a slide that I presented last week for our Bioengineering lecture at KAUST:

[image: image.png] This one is emphasizing the whole central dogma "stack". RNA would typically be left out. The domain pill-boxes could also be centered on top of the protein line as the CDS is on top of the DNA line.

Hope that helps somewhat. Greetings Raik

On Sat, 4 Nov 2023 at 14:19, Raik Grünberg @.***> wrote:

Hi Gonzalo,

as a general comment, it looks to me as if we are running here into the quite regular issue of non-practitioners trying to set standards without looking at the current community practices in literature and seminar slides, however confusing they may be. A bit off-topic but, for example, as a protein biochemist / engineer, I really dislike this "helix stem" for designating protein features. I don't think any protein engineer would ever use it or even recognize it as something related to proteins. It also is unnecessarily crowding the space and, besides, most protein features cover a range of residues rather than single sites. And a word of caution: the dimensions in your example are extremely off. Cloning scars and even RBS regions are tiny compared to the average length of a CDS.

Fully on-topic: the most common protein annotation will not be tags or protease "sites" but domains, that is longer functional regions. The IMO best depiction for that is a "pill-box" shape. Whatever you come up with here, needs to start from that. Single-residue "site annotations" should be a secondary concern. First you have to figure out how to draw two domain regions into your protein, perhaps with a protease site in between and two catalytic sites on top of one of them to make it more fun. I think you end up with a situation where it is in fact cleaner, definitely more intuitive, and not much more space consuming to draw a separate protein line above the CDS and start populating this line with the protein features. You could then also have the line zoom in (be longer than the CDS symbol below, with dashed connectors back etc). And then you could just leave away this pescy helix stem :) the symbols can sit directly on this protein line.

Anyway, that would be my suggestion. Starting from that, one could later see whether things could optionally be compressed back onto the top line of the CDS as described in Felipe's image. My guess is that this could work sometimes but, as soon as you start putting exon and cloning scar annotations into the mix, I think it may quickly get confusing.

Just my two cents of course... Raik

On Sat, 4 Nov 2023 at 00:41, Gonzalo Vidal @.***> wrote:

[image: CDS_parts_representation] https://user-images.githubusercontent.com/35148159/280417568-84461a81-b178-423b-b65d-f10709aeb986.png

I liked @shyambhakta https://github.com/shyambhakta alternative A, then if we implement that I have some questions. Lines inside a TU should be arrows or can be a straight line? In my examples I use straight line as allows it uses less space in the X axis.

A is how I show my designs using the actual standard.

B Then to represent internal components of a part the trivial way would be to just show the part as a component with a line inside separating different parts. This is also compatible with how to represent it in the data standard.

C Including @shyambhakta https://github.com/shyambhakta A) ideas, I could represent them in this way but then if I represent an assembly scar how should it be done.

D Is a mixture between B and C where you just represent sequence features relevant to the image and show that is the composite and all the other details are ommited.

— Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-visual/issues/167#issuecomment-1793136298, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOGZXJFY6UGC77P2MHZHGTYCVQHFAVCNFSM6AAAAAA62IHQCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGEZTMMRZHA . You are receiving this because you were mentioned.Message ID: @.***>

--


Raik Grünberg http://www.raiks.de


--


Raik Grünberg http://www.raiks.de


Gonza10V commented 11 months ago

HI Raik,

Could you please edit your post to re upload the image, It is not visible to me. Also, do you have more examples about how practitioners describe features of a protein in DNA, that would be very useful to align with current practices.

graik commented 11 months ago

Hi Gonzalo,

image re-attached...

I would guess that there are in fact not many examples of protein features annotated on graphical depictions of DNA constructs. Protein features are typically annotated separately on protein representations. If you want to have both levels of detail in the same figure, you would by default do it as I described with DNA and protein displayed next (atop) each other. In a protein engineering-focused study, the DNA level is typically not shown at all.

Greetings Raik

On Sat, Nov 4, 2023, 15:19 Gonzalo Vidal @.***> wrote:

HI Raik,

Could you please edit your post to re upload the image, It is not visible to me. Also, do you have more examples about how practitioners describe features of a protein in DNA, that would be very useful to align with current practices.

— Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-visual/issues/167#issuecomment-1793428249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOGZXLCXQSOHRBSXDHYMV3YCYXFXAVCNFSM6AAAAAA62IHQCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGQZDQMRUHE . You are receiving this because you were mentioned.Message ID: @.***>

shyambhakta commented 11 months ago

Hi @graik, still no image on Github or the email thread. I do like the concept of a peptide backbone that could mesh with a protein language. What you're describe is perhaps something like in (E) or (F), going off your 2019 comment and the published example in issue 68. I think these issues may need to merge. I'm not sure if switching the backbone shape is important. Curves like in (E) might evoke linkers / less-structured peptides and make it more intuitively proteinaceous, since straight might "feel" too rigid to be protein, but I can't say how long a curved backbone needs to be aesthetic like in (I). Then again, a long segment of straight protein backbone would also look weird. Maybe N- and -C termini can be marked if the backbone ends are exposed. Anyhow, so long as there are pill boxes relatively close on even a straight peptide backbone (F), and perhaps if the ends aren't visible, it wouldn't be confused for a protein superposed on DNA, like I recall being shown in specs to show protein:promoter binding. I think it would be great to work toward formalizing a protein backbone for protein glyphs.

In terms of the protein function glyphs, I think the X for cleavage is still intuitive, but like I mentioned earlier, a shield for what is typically a degradation tag is bad; we need a separate glyph for negative stability/degradation, eventually. Also a secretion/localization and maybe affinity tag function glyph. But any novel functional glyphs are of less priority since they are slow to catch on, as they need familiarization with. The very basics of the protein language with pill boxes on a backbone is open-ended and intuitive.

Superposing protein-stem glyphs in (G) with domains doesn't look too awful aesthetically, but it feels a bit weird to put glyphs where it seems text belongs. CDS/domains seem to "want" text like in (E) and (H). But perhaps the glyphs could stand where abstraction is desired. Protein-stem glyphs on top of the CDS leaves space for text in the CDS, but I'm not sure I like the aesthetics of things dangling off. It can nicely provide abstraction, though, as the TEV site can be shown compactly between the two CDS domains in (H) without needing a separate domain as in (G).

And I agree, cloning scars would not be practically used so prominently. @Gonza10V, I show part boundaries/scars when important with just dots like in (G) and (H). Doesn't steal from the show this way. If you really want to use the scar glyph, I think the specs may require a white box around like in (E), so that it interrupts and stands out from the DNA backbone. At least that's how I've seen it in examples. Also, in your (B)–(D), CDS subdivisions have to be indicated with angled lines like below, not straight, I believe.

image

graik commented 11 months ago

Sorry, I kept using the Reply-to from GMail but forgot that this never works for images. Here it is: image

graik commented 11 months ago

@shyambhakta Your F and E definitely look most natural to me. Curved linkers are not really something I have seen but it doesn't look bad either. H and G look weird IMO... (I) also... so I guess the simple straight line is best. And I agree that the pillbox+text can for now be used for lots of things before expanding into a whole list of custom symbols.

Gonza10V commented 11 months ago

As I see in @graik image protein engineers show protein features in the protein and not in the CDS. But in SBOL we still dont have a protein language #68 nor RNA language #79. The solution of indicating a part as a composite and then showing its components in RNA or protein needs the development of the latter two. The development of the protein language would be enough to represent something similar to the example provided. Now focusing in DNA, if we want to represent the details there, G from @shyambhakta example is the intuitive for me.

graik commented 11 months ago

I am not sure G is the best template. Doesn't look intuitive to me (I would never guess there is any protein related info there). Plus, if you scale things down to a normal size, things get crowded and difficult to read very quickly. A more general solution would be to clearly separate protein from DNA features by always having each on its own line. Above each other if you want. Same goes for RNA, IMO. Definition would be easy: (1) RNA features are to be displayed on a wavy line above the corresponding DNA feature / location. (2) Protein features are to be displayed on a straight line above the corresponding DNA or RNA feature. (3) If the RNA or protein line shows the complete molecule, this can be indicated by a line terminator such as ---| or ---o

I think this would be intuitive, visually appealing and still very compact.

jakebeal commented 11 months ago

When I've been marking protease cleavage sites in the past, I've tended to use something like the H figure, so that I can indicate the DNA location of the encoding for the protease cleavage site.