SynBioDex / SBOL-visual

The reference implementation of the SBOL Visual standard
Other
32 stars 16 forks source link

Recommended representation for sgRNA/dCas9 interaction #110

Closed Gonza10V closed 3 years ago

Gonza10V commented 3 years ago

Which one is a better representation of the process:

We used to do this:

Screen Shot 2020-10-13 at 13 01 20

but now I think this could be more accurate:

Screen Shot 2020-10-13 at 13 00 14

It just need the interaction node Process (https://github.com/SynBioDex/SBOL-visual/tree/master/Glyphs/InteractionNodes/process) in the joint between the sgRNA and dCas9. I draw the images in SBOLcanvas and drop an issue asking for that interaction nodes feature (https://github.com/SynBioDex/SBOLCanvas/issues/135)

Please give me any feedback from they way to add issues or the workflow if Im not following something.

jakebeal commented 3 years ago

Yes, it does definitely need the interaction node at the junction.

Once that is there, however, the second is correct while the first is problematic. The reason is the interaction node, which is an SBO process (https://www.ebi.ac.uk/sbo/main/SBO:0000375) whose inputs are its reactants (SBO:0000010 Reactant). Coding sequences and ncRNA aren't reactants, though. Instead, their products are reactanges. Thus, the semantics of the top version aren't currently supported.

It's possible we could end up with a semantics that supports implicit nodes, but that's still being worked out on SEP V018 (#73)

Gonza10V commented 3 years ago

Sorry for the previous close and reopen I miss-clicked. I just wanted to share my current implementation that I think is according to SBOL3, if it is can work as example. dCas9-sgRNA-TMAINU

jakebeal commented 3 years ago

@Gonza10V This is correct, but there are a few best-practice recommendations in the specification that would improve it further:

You might also want to consider whether to show the genetic production of LacI and TetR from their CDSes, but that's optional. You might, however, want to change the color for lacI/LacI/IPTG to look less like venus.

Gonza10V commented 3 years ago

Thanks for the feedback @jakebeal !

jakebeal commented 3 years ago
  1. Yes, please make an issue for differentiating the Process example. We can use some other form of reaction instead.
  2. Yes, keeping the reverse complement is still valid, just not recommended (i.e., SHOULD rather than MUST, per section 5.2.2 and Figure 10 in the specification)
shyambhakta commented 3 years ago

It's common to want to simplify an entire gene as a single glyph: in the diagrams above Kanᴿ and Carᴿ (kanamycin and carbenicillin resistance genes) are shown as CDS arrows. I myself like to use the CDS glyph to represent constitutive resistance and transcription factor genes for which inclusion of more detailed promoter, RBS, and terminator glyphs makes for undesired clutter. Yet resistance is a phenotype, so can't technically be the label for just a CDS, which would have to have CDS names, e.g., aphA1/nptI/aphA2/nptII or bla for these resistance genes. But because these CDS names are less familiar, it's preferable to write the resistance phenotype. This extends to transcription factors, e.g. lacI^q which is a stronger promoter mutant of the wild-type lacI gene, and thus cannot describe just a CDS.

So what is the recommendation for a glyph of an entire gene, with it's tx/tl initiation/termination elements? Perhaps this should be a new issue if not already described. Some people write the gene in a box instead of an arrow, but this isn't a general solution as gene direction is still useful to describe, hence the use of the CDS glyph. Also, I'm not sure such genes necessary qualify as engineered regions; they're often natural, hence wanting to abstract into a single gene in the same way the many genes with distinct promoters, CDSs, ncRNAs, etc mediating replication origin function are abstracted into a single origin glyph.

An aside: I like to use the American Society of Microbiology's proscribed abbreviations for antibiotics / resistances: https://aac.asm.org/content/abbreviations-and-conventions

jakebeal commented 3 years ago

@shyambhakta In mammalian system engineering, I often just don't see the ribosome entry site and terminator represented at all. Given the differences in eukaryotic transcription, it seems they are often taken as implicit by the presence of a CDS. We are never required to draw glyphs for every base, so why not just omit those portions if they aren't of interest?

shyambhakta commented 3 years ago

The issue isn't the possibility of glyph omission; it's that a whole gene/locus name be placed on / assigned to a CDS glyph when not describing just the CDS alone but rather more parts than the CDS, even more than one gene's worth if an operon.

For example, an arabinose transcriptional activator + transporter operon araCE may popularly be represented by just a single CDS glyph labeled ara or araCE. It's not clear that an SBOL CDS glyph can be used more broadly for such genes, gene clusters, loci, some described by phenotype like the resistance genes I mentioned earlier, which can't impart resistance/other functions as just a CDS. I think it needs to be made explicit if it's somehow already permitted, because I felt like I was willfully violating SBOL when choosing to do so, despite it being common in the literature.

Another case: I use a native cytochrome C maturation gene cluster cloned into a plasmid. But instead of taking a bunch of space with eight CDSs in order labeled ccmA ccmB ccmC ccmD ccmE ccmF ccmG ccmH, I and others in the literature simplify it to (the equivalent of) a single CDS glyph labeled ccm which encapsulates the whole operon / gene cluster, from promoter to terminator.

Another example: sets of plasmid partitioning genes used to stabilize plasmid inheritance are often named with the prefix par, and their operon(s) (e.g. parABCD) and their associated promoters/terminators are often simplified to the equivalent of just one CDS glyph labelled par, in the primary direction of transcription if there is one.

These sort of common and popular usage cases to me clearly violate the specs of the CDS glyph, which show associated with only CDS SO:0000316: A contiguous sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon. The far, far-broader term "gene" is the proposition for what the CDS glyph is also describing in these examples, gene SO:0000704 A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions. Its children include gene_with_polycistronic_transcript and gene_cassette, and so may cover the examples above, though the usage encompassing multiple genes in multiple operons/directions would require another level up, gene_group SO:0005855**, A collection of related genes.

A dual-use of the CDS glyph for both a CDS and entire gene cluster is seemingly contradictory, and I'd expect contention on the matter if the consensus is to allow it. And if not, the design of a new gene / gene cluster glyph attempting to replace the traditional could also be contentious.

By the way, I assumed RBSs are perceived to be a bacterial feature, given the lack of necessity of sequence-specific association of ribosomes to mammalian mRNA 5′ UTRs and a Kozak sequence, rather, mostly by 5′ mRNA cap-association and scanning to a start codon.

jakebeal commented 3 years ago

I think the potential to address this by expanding the set of associated SO terms is a good one. Can you please open a specific issue for discussion of this?

jakebeal commented 3 years ago

@Gonza10V Are there still questions open that you want to address regarding this specific issue, or can it be closed in favor of the new issues spun out from it?

Gonza10V commented 3 years ago

Thanks @shyambhakta for the proscribed abbreviations for antibiotics / resistances I will implement it, they need the R super index? I usually add it to whiteboard discussions but I dont know if should be added in paper figures.

Screen Shot 2020-10-14 at 23 38 53

@jakebeal This is my latest implementation of the figure and my despite the problems with sequence overlapping and reverse complement display of parts it should be ok, but now the Association node is similar to the origin of replication glyph, is that a problem? I made it smaller and all the interactions in gray to distinct between them. Should be recommended to show interactions in other colors or black is the recommended?

I liked the point that makes @shyambhakta about the resistances and how will be the best form to represent genes or transcriptional units without show all components. We opt for the arrow but that is the second glyph of CDS that as you said is not the correct form to represent a whole gene or gene cassette.

shyambhakta commented 3 years ago

Phenotypes are systematically written with only an initial capital letter, so Kanᴿ and Carᴿ/Ampᴿ. The ASM abbreviations are for the antibiotics alone, which I still don't like to write in all-caps; draws unnecessary attention I feel. Also, I'd prefer replication origins being labeled; gives an idea as to the gene dosage.

Gonza10V commented 3 years ago

I think we discuss the corresponding to this topic, now we will see the issues generated by this topic. Thanks a lot @jakebeal and @shyambhakta .