Analytical PCR "recommended oligo" code needs to be made generic or removed from v1.0 CC

bthuronyi commented 1 month ago

Analytical PCR tab currently has "recommended oligo" code (for 2 primer pairs) that works ok for our specific Golden Gate construct set but doesn't scale to other GG sets or handle GGs that don't have 8 parts of typical sizes.

The current algorithm hard-codes a response for 8-part GGs and it selects:

F primer for part 2 and R primer for part 5
F primer for part 7 and R primer for part 2 The logic for this hard-coding is that these primer pairs work ok for the typical part lengths and types we use.

We currently store F and R primers for many (but possibly not all) GG parts. Those fields are optional and probably many CC users won't want to always provide an analytical PCR primer for every part, so we should account for situations where some parts don't have primers available.

A general approach will need to use information in Registry_part_long_short for each rID in the assembly and these considerations.

We want to suggest the minimum number of analytical PCRs that will verify all parts in an assembly
We want to tag each suggested PCR with the part(s) that it will verify
A short part MUST have either its forward or reverse primer used in a PCR to be verified present
A long part can be verified present if EITHER its forward or reverse primer is used OR if a PCR spans the part
- e.g. if a PCR uses primers from parts 1 F and 3 R, then if part 2 is long, that PCR verifies 1, 2, and 3
Ideally, but probably not important for a first implementation, the PCR sizes will not be too long (e.g. avoid PCRing across the entire construct) and will be different from each other within the same construct -- but things still work ok if both these are not true

bthuronyi commented 1 month ago

Thinking through possible implementation considerations:

An assembly with n parts in it if all parts are short will use n/2 PCRs (rounded up). This is also a trivial solution for any assembly if we ignore ability to verify long parts by PCRing across them.
We can start from this trivial solution and then see which PCRs we can drop based on part(s) being long
PCR 1 could always use 1 F
- The R primer could be from the part AFTER the next long part
- e.g. if part 2 is long, PCR 1 uses 3 R
- if part 2 is short and 3 is long, PCR 1 uses 4 R
- if parts 2 through n inclusive are all short, then default to n R
Then determine which parts are verified by PCR 1 and output the result to a cell
- part 1 is always verified
- the R primer part is always verified
- any long parts between 1 and the R part are verified
Next PCR looks at the PCR 1 verification list
- use the F primer from the first not-verified-yet part
  - if all parts are verified already, then return blank (but see below)
- same approach as PCR 1 for the R primer -- use the R primer from the part AFTER the next long not-yet-verified part
- if all not-yet-verified parts are short, use the largest number one for R
- output verified parts -- not clear if this should be cumulative or specific to PCR 2; cumulative makes formulas easier but specific to PCR 2 is useful for users to see
- if there are no more not-yet-verified parts besides the F primer one, then
  - use the R primer from the part AFTER the next long part (even if it's already verified)
  - if there is no next long part up to part n, then start again with part 1 and continue on
    - if you get back to the same part we picked the F primer for, then just use the R primer from the part before this one
Further PCRs can continue to use this same approach I think

Testing this algorithm by hand: SSLSLL (n=6) > 1F 4R verifies 1,3,4 2F 6R verifies 2,3,5,6

Whoops - even though we verified all 6 parts, we didn't confirm that 1 connects to 6. That is, what we actually need to verify is the part-to-part junctions and their order, not the parts themselves. It's tricky to reframe the "verified" terminology this way -- we don't actually verify the junctions between short parts, even by using primers on each one, because there could be some short part in-between and we won't see it -- so instead just do a final check for some PCR going across the 6-1 junction.

Add to PCRs 2+ logic, IN ADDITION TO checking whether all parts are already verified:

If the previous PCR verification list wraps around -- the final part number it verifies is larger than the initial one -- then we're done.
If it doesn't, do one more PCR according to the existing criteria. Presumably if we already verified all the parts but just didn't meet the wrap-around criterion, the F primer will be selected for whatever part corresponds to the PCR number, and the R primer will end up crossing the n to 1 border because all parts up to n are already verified, and we'll end up using the R primer for the part before the F primer.

Test for SSSSS (n=5) > 1F 5R verifies 1,5 2F 4R verifies 2,4 3F 1R verifies 3,1 wraparound criterion met -- done

For PCR 1 we can't use 1R because it might be a reverse complement of 1F, and 2R would give a very short band plus one that wraps around so it's not ideal. In fact we need to make sure we don't ever use the F and R primers from the same part. Actually using F and R from the same part is ok for long parts, but it also seems unimportant to allow it.

Here, we were saved from doing a too-short PCR by the odd number of total parts. What about SSSS (n=4)? 1F 4R verifies 1,4 2F 3R verifies 2,3 but is likely to be way too short (the primers are supposed to be at the edges of the parts, so these might be only a few bases apart).

We need a new criterion which is that F and R parts from adjacent numbers aren't used. This is bad even for long parts. So we would do: SSSS (n=4) > 1F 4R verifies 1,4 2F 1R verifies 2,1 (wraparound) 3F 2R verifies 3,2 (wraparound)

Looks good!

LLLL (n=4) > 1F 3R verifies 1,2,3 4F 2R verifies 4,2 wraparound Good!

LLLLSSSS (n=8) > 1F 3R verifies 1,2,3 4F 8R verifies 4,8 5F 7R verifies 5,7 -- is this ok? maybe too short... Should our criterion be about having >1 short part "inside" each PCR? 6F 1R -- I think, depending on how it's coded -- might also be 6F 2R, but either way 6 gets verified and we wrap around.

Open question: reject PCRs that have only 1 short part inside? Might not be too hard to do. We already need to check that the part numbers are not adjacent or the same, and in the case that the part numbers are separated by 1 part, we can check whether that part is short -- but need to handle wrapping.

bthuronyi commented 1 month ago

The above code seems doable so I think we should try to implement this general approach for v1.0.

bthuronyi commented 1 month ago

A trickier case is if some of the selected primers don't exist. We could just punt on that one: if we would be recommending a given primer but find that the Registry entry for it is empty, then just replace it with "Forward primer for part r### - not listed". If the user wants to use our algorithmic analytical PCR recommendation they can queue an appropriate primer and register it, and if not, they can do it by hand for that case.

bthuronyi commented 1 month ago

A more sophisticated approach if some primer(s) aren't available is to skip over those parts and keep going to the next choice you would make, but then you need to define a failure criterion where you can't design that PCR... and I'm afraid our overall algorithm as I designed it is too fragile to deal with that.

bthuronyi commented 1 month ago

"Parts verified" column would be automatically populated by formulas for each suggested PCR. If we don't implement automatic primer recommendations, we should make that column optional (blue header), give users write access, and make it a Named Range.

shen2333333 commented 1 month ago

Sorry it took too long for me to get to this, here's some of my thoughts.

Thinking about big picture a little bit, also refresh myself on my progress done in my senior year. Analytical PCR is part of the "build" that is like a QC for verifying the construction of the plasmid. The best way is obvioiusly do a sequencing, especially whole-genome sequencing using nanopore method (e.g. plasmidsaurus) for large plasmid like Golden-Gates with multiple parts. It's getting relatively cheap, $15 per plasmid but it's still a good idea to see if the purified plasmid from miniprep is at least likely to be the construct we want than just dump $60 for all 4 minipreps for example. That's where analytical PCR comes in.

Although we typically do 8 part golden gate because of the marburg system we are working with (correct me if I'm wrong), NEB said golden gate itself is possible for lots and lots more parts (up to 50+) as you know already. I don't think the goal of analytical PCR is necessarily verify every single part before we send them to sequencing, would be overkill. My feeling is that diminishing return will kick in really fast (see if you agree with me). For hypothetical example, for an 8-part construct, doing 1 PCR, and having it verified by gel, will give it 80% chance that entire plasmid is build correctly, doing a second one increase the chance to 90% and the third one to 99% for example. Then to balance # of PCR needed to be done to be confident enough for it to be a correct construct,

Central question is: Doing at least 1 PCR is probably a good idea for construct with lots of parts, but then how much PCR do we need to do to feel confident enough before we send it to sequencing.

Bunch of approaches I envisioned

No work on our side. Remove recommended oligo. We left the user do everything. Select and/or design primers on parts of interest to increase confidence before send off to sequencing (if deemed necessary).
Some work on our side; Suggest one (or two) PCR to do, then let the user decide if it's good enough to move on to sequencing, if not, the user will select/design further primer pairs to keep testing.
Lots of work on ourside; the approach you suggested, that provides a general approach to determine minimum PCR needed to verify all the parts. I'm not sure if I have the bandwidth on my side, but certainly a interesting coding project for students that are interested. And even better, since we have some past data, we can get an idea of # of PCR that has been done and if sequencing are verified at the end, and maybe show a confidence metric, something like we are now 90% confident that the plasmid is correctly put together after the 2 primer pairs we tested is verified by gel.

My current feeling is leaning toward 1 or 2, but I don't mind 3 (I just thought it's a lot of work and I might not able to flush it out in a short period of time), it's a very interesting project though.

bthuronyi commented 1 month ago

I don't think the goal of analytical PCR is necessarily verify every single part before we send them to sequencing, would be overkill. My feeling is that diminishing return will kick in really fast (see if you agree with me). For hypothetical example, for an 8-part construct, doing 1 PCR, and having it verified by gel, will give it 80% chance that entire plasmid is build correctly, doing a second one increase the chance to 90% and the third one to 99% for example. Then to balance # of PCR needed to be done to be confident enough for it to be a correct construct,

Yeah, this is correct if analytical PCR is primarily used to increase the success rate of sequencing. One other thing people might use it for is to bypass sequencing and instead adopt an mID based on analytical PCR confirmation only. This could be helpful if there's a large number of constructs to make; PCR is quite easily scaled and gels can be somewhat easy to scale as well. It can also be faster than sequencing (though this is changing with nanopore availability) in terms of overall turnaround time.

bthuronyi commented 1 month ago

That said, with where v1.0 is at the moment, I think (2) is a good compromise for that release and if it seems like a lot of work we could go to (1). I would love to include approach (3) in v2.0!

bthuronyi / CloneCoordinate

Analytical PCR "recommended oligo" code needs to be made generic or removed from v1.0 CC #133