BEP 001 -- Representation of Parts and Devices for Build Planning

jakebeal commented 2 years ago

This SEP proposes a set of terminology and practices for representing genetic parts and functional devices at various stages of design, synthesis, and assembly. These practices are intended to represent any of the wide array of approaches based on embedding parts in carrier vectors, such as BioBricks, Gateway, MoClo, GoldenBraid, PhytoBricks, and other Type IIS methods.

Draft at: https://github.com/SynBioDex/SEPs/blob/master/sep_055.md

vinoo-igem commented 2 years ago

I'm seeing construct along with some modifiers (final, simpler, assembled, DNA, etc.) Not knowing SBOL terms, but does construct = designed nucleotide sequence?

jakebeal commented 2 years ago

Yes, I was using "construct" to mean "designed nucleotide sequence", but it is not an SBOL term --- I was specifically trying to avoid using "part", "device", or the SBOL term "Component" in the English definitions.

ethanj801 commented 2 years ago

Three questions.

How are genomically integrated constructs represented. In certain organisms (e.g. B. subtilis), it is standard practice to integrate devices directly onto the genome (sometimes without even an intermediate plasmid amplification step)? Is that out of the scope of this document? One could perhaps imagine a scheme where the parts or devices are associated with a genomic position or with a "genomic insertion site" vector (this would also allow representations of "native" devices as well).
How are multi-plasmid devices represented?
How is thehasMeasure property represented? Is it an object as well (presumably with measurement type, context validity/confidence, etc)? What about multiple measurements for a given component? What about measurements that are given in reference to another part? I'm not quite sure I understand exactly the schema. Here are a few examples of measurements that one would think our system would want to be able to handle.

A promoter has a measured "strength", measured in absolute units (e.g. PoPS or Transcription Rate)
A promoter has a measured "strength" measured in relative units (e.g. fluorescence in a mRFP producing device as compared to a standard reference device with J23100)
A device works in many organisms, but produces different amounts of a product in each.
A complicated device has multiple protein components whose concentrations are measured over time using a series of related devices with fluorescent fusion proteins for each expressed protein in the device.
The part-junction interference of simple devices that constitutively express a single protein product is measured.
The turnover rate or catalytic efficiency of a protein-coding sequence is measured in relative or absolute units.
A functional RNA device (e.g. an RNA thermometer) has multiple possible conformations. The precise conformation that the device takes is measured or calculated across a variety of different conditions.
The editing efficiency is of a CRISPR/Cas base editor is measured on a particular DNA base in a fixed sequence. The efficiency differs depending on which gRNAs are used and what vector the fixed sequence is found in.

It just feels like there is such a large space of what measurement can mean, so I'm having trouble wrapping my head around how exactly we are representing in a way that will enable useful tools/predictions.

As an example, one could imagine that if you were combining a promoter, 5' UTR and coding sequence, you could do some basic math and determine a predicted amount of protein product. You could then imagine that if there is some sort of known interaction between the UTR and CDS, that could be found and then calculated. A question I would have is what is the best way to store this interaction? Should it be associated with each component? Should it be associated with a SubComponent (are there issues if SubComponent are not annotated in the same fashion)? Should it be stored separately as a different type of object and then annotation be done by software at the time of prediction. I think intuitively we'd want to store the information and references to other components on the component (or a linked measurement object) itself to avoid searching. Although that might lead to bloat for commonly characterized parts we could easily imagine a system that grabs only the relevant measurements (whether that be by category or reliability).

jakebeal commented 2 years ago

@ethanj801 Good questions; I'm adding my thoughts here and have updated the SEP with clarifications.

How are genomically integrated constructs represented. [snip]

I think in many cases it can be basically the same as for insertion into a plasmid backbone. The only question is whether there are better ways to represent an insertion locus than an index into a genome sequence. I don't know the answer to that.

I've added the following to the discussion section:

Although representation of genomic integration is not explicitly within scope of this SEP, in many cases genomic integration can be represented in the same was as insertion of a part into a backbone. This can be done simply by substituting the genome for the backbone.

How are multi-plasmid devices represented?

Most multi-part devices will be agnostic about whether they end up on one plasmid or multiple plasmids. For example, a TetR/pTet repressor device could end up with TetR on either the same plasmid or a different plasmid from pTet, depending on the design. We allow this simply by not specifying anything about the plasmid at the level of the device. When it gets incorporated into a larger system, we indicate it with constraints of locations in the composite part or parts that include the device.

I added the following to the draft:

A device with multiple parts might end up with those parts being placed at different locations within a single plasmid or being placed on more than one different plasmid. In order to allow flexibility in how a device is used, the device Component SHOULD allow for either option unless there is a functional reason to constrain designs otherwise (e.g., some recombinase devices require parts to be on the same strand).

How is the hasMeasure property represented?

The SBOL specification has an explanation of the hasMeasure property and some examples. For larger usage recommendation of what, exactly, to record and in what contexts, I think we're not yet ready to commit on that and thus believe it's out of scope for this document. I've added a note to this effect in the discussion section.

ethanj801 commented 2 years ago

Most multi-part devices will be agnostic about whether they end up on one plasmid or multiple plasmids. For example, a TetR/pTet repressor device could end up with TetR on either the same plasmid or a different plasmid from pTet, depending on the design. We allow this simply by not specifying anything about the plasmid at the level of the device. When it gets incorporated into a larger system, we indicate it with constraints of locations in the composite part or parts that include the device.

Perhaps I am misinterpreting what you are saying, but I'm not sure this is true. How many plasmids you use (and which plasmid origins you use) has a big impact on the functionality of the device. One could imagine quite different looking transfer functions between a pTet GFP that is being repressed by a constitutively driven TetR on a medium-low copy number plasmid vs a high copy number plasmid. Using one vs two plasmids could also impact what other devices are able to be integrated along with it (due to origin or antibiotic resistance incompatibility). By my reckoning, even the simple choice of using two different plasmids should fundamentally increase the noise of the device due to plasmid copy number fluctuations (though maybe this isn't a problem for the more tightly regulated plasmids).

jakebeal commented 2 years ago

That's a very good point, and comes back to the question of what models and measurements we will find useful to associate with the devices, and how sensitive the devices are to particular variations. For example, in transient transfection of mammalian systems, we've found that one plasmid vs. multiple plasmids has little effect on device behavior (the plasmids aren't replications and are delivered in high numbers).

I think the right way to deal with this is likely to be to have devices include whatever context information we believe is important in as abstract a form as possible. For example, if the TetR/pTet repressor device is defined in terms of a high-copy plasmid only, then it might include two abstract plasmids, each marked with information about copy-count and containing one of the parts, but without its identity specified. Then when the device is used, the abstract plasmids would be given an identity constraint with either one or two real plasmids. In a one-plasmid system, they get identified with the same real plasmid, in a two-plasmid system, they get identified with different plasmids.

Using one vs two plasmids could also impact what other devices are able to be integrated along with it [snip]

Absolutely, and that's why I want to leave the plasmid locations agnostic when possible, to allow flexibility in choosing how to organize functional units in a larger system that uses the device.

GC-repeat commented 2 years ago

Where would a linear fragment (PCR or synthesis) composed of a unitary part with 5' and 3' flanking sequences for restriction digest based or gibson assembly fall into? Based on the previous discussion I think it would go in 'part in backbone'. Part in carrying vector - this is a sample in the distribution. The vector might be null (e.g., a linear fragment) If so, should we add the 'The vector might be null (e.g., a linear fragment)' to the SEP text too?

jakebeal commented 2 years ago

@GC-repeat I believe it depends on the specifics of the flanking sequences.

If the flanking sequences have been added for synthesis into a backbone, then it would be a part insert.
If the flanking sequences are double-stranded and waiting to be digested, then it would be a part in vector (where the vector happens to be linear and pretty small)
If the flanking sequences are single-stranded overhangs ready for ligation, then it would be a part extract.

Were you thinking of one of those scenarios, or something else?

GC-repeat commented 2 years ago

That makes sense. Yes, I was thinking about those scenarios. A part for gibson assembly would correspond scenario 1, and for restriction digest based assembly to scenario 2. Concerning scenario 2, the vector may be non replicative so it would not be a backbone as it takes the role SO:vector_replicon, right ?

jakebeal commented 2 years ago

If I'm understanding correctly, an example of such a construct for scenario 2 could be a 1000-bp linear double-stranded DNA construct with the following structure: 5'padding-BioBrickprefix-CDS-BioBricksuffix-3'padding.

In this case, then you are right: it wouldn't be a replicon because it has no origin of replication --- it's just something that we've produced an aliquot of via synthesis. It's a vector in backbone, but the backbone is not a replicon. We would then need to fall back to the generic SO:engineered_region, since neither parent applies (SO:replicon and SO:clone). Does that sound correct to you?

jakebeal commented 2 years ago

@GC-repeat I've put an update in a pull request; can you please take a look and see if this handles this case well? https://github.com/SynBioDex/SEPs/pull/114

GC-repeat commented 2 years ago

If I'm understanding correctly, an example of such a construct for scenario 2 could be a 1000-bp linear double-stranded DNA construct with the following structure: 5'padding-BioBrickprefix-CDS-BioBricksuffix-3'padding.

In this case, then you are right: it wouldn't be a replicon because it has no origin of replication --- it's just something that we've produced an aliquot of via synthesis. It's a vector in backbone, but the backbone is not a replicon. We would then need to fall back to the generic SO:engineered_region, since neither parent applies (SO:replicon and SO:clone). Does that sound correct to you?

That sounds good to me.

jakebeal commented 2 years ago

Thank you; I've merged the update in.

pengbingyin commented 2 years ago

Yes, I was using "construct" to mean "designed nucleotide sequence", but it is not an SBOL term --- I was specifically trying to avoid using "part", "device", or the SBOL term "Component" in the English definitions.

I prefer to using 'construct' referring somethings that have been fully constructed. I commonly use 'fragment' to referring to the 'parts', and I used 'segment' before, but I am not really sure about the meaning of 'segment'.

From genetic engineering view, there should be attribution: HasParentID (this can be a single ID, or a combination of ID), GeneratedbyAction (like 'digestion HasParentID using enzyme B, purified size = xxx bp', PCR amplification from HasParentID using oligo ID A and oligo ID B', 'Assembly from HasParentID).

If these are not relevant, please pardon.

jakebeal commented 2 years ago

@pengbingyin In many cases, a part may actually be a fully constructed system as well (particularly composite parts): it all depends on the design goals and whether a person later chooses to combine it with something else.

With regards to the representation of attribution: this document recommends a representation using the prov:wasGeneratedBy property to link to a prov:Activity representing an assembly plan, with the assembly plan represented by a network of reactions (e.g., digestion and ligation). I believe that this model can represent the sort of structures that you are describing. Can you please take a look more deeply and say if you see aspects that you believe are important that are unable to be represented under the proposal?

pengbingyin commented 2 years ago

@pengbingyin In many cases, a part may actually be a fully constructed system as well (particularly composite parts): it all depends on the design goals and whether a person later chooses to combine it with something else.

With regards to the representation of attribution: this document recommends a representation using the prov:wasGeneratedBy property to link to a prov:Activity representing an assembly plan, with the assembly plan represented by a network of reactions (e.g., digestion and ligation). I believe that this model can represent the sort of structures that you are describing. Can you please take a look more deeply and say if you see aspects that you believe are important that are unable to be represented under the proposal?

There are several ways to do cloning.

(1) non-golden gate way, the backbone is mostly digested to become a intermediate format as a linear DNA fragment. In this case, 'Insertions Sites and Drop-Out Sequences' are not able to be pre-defined, except the backbones being a commercial cloning kits and used for one-step cloning. Most cloning works requires a process of sequence analysis and decision making. This could be challenging.

(1.1) the parts can be a PCR fragment: this will need to define oligos (annealing sequence + over-hang sequence) and template.

(2) golden gate method, the parts and backbones are in format of circular plasmids. In this case, it is necessary to define the golden gate levels. see https://www.researchgate.net/publication/310780764_Editing_of_the_urease_gene_by_CRISPR-Cas_in_the_diatom_Thalassiosira_pseudonana/figures?lo=1

(3) A new way for plasmid cloning https://www.biorxiv.org/content/10.1101/2021.12.31.474679v1.full.pdf is reported. Maybe, this should be considered for novelty purpose.

jakebeal commented 2 years ago

@pengbingyin You are not really answering my question. Are any of these unable to be represented with the current proposal?

pengbingyin commented 2 years ago

@pengbingyin You are not really answering my question. Are any of these unable to be represented with the current proposal?

With regards to the representation of attribution: this document recommends a representation using the prov:wasGeneratedBy property to link to a prov:Activity representing an assembly plan, with the assembly plan represented by a network of reactions (e.g., digestion and ligation). I believe that this model can represent the sort of structures that you are describing. Can you please take a look more deeply and say if you see aspects that you believe are important that are unable to be represented under the proposal?

prov:wasGeneratedBy and prov:Activity should be sufficient.

jakebeal commented 2 years ago

@pengbingyin Thank you for you contribution and for taking the time to make a careful assessment! If you want to contribute examples for inclusion, a pull request including such would be welcomed as well!

jakebeal commented 2 years ago

@nroehner has added a bunch of diagrams to the examples section of the SEP to illustrate the representations.

ethanj801 commented 2 years ago

Thank you @nroehner this is much needed I think.

jakebeal commented 2 years ago

I realized that the current proposal doesn't take advantage of the Interface option on Component to explicitly indicate the inputs and outputs of an assembly reaction. I also add an explicit statement that it's OK to have build plans that produce multiple different composite parts (e.g. sharing intermediates, or considering some intermediates as finished products).

I have set up a pull request for this change: https://github.com/SynBioDex/SEPs/pull/117 , and would appreciate folks commenting there if they like or dislike this change.

SynBioDex / SBOL-examples

BEP 001 -- Representation of Parts and Devices for Build Planning #2