ericmjl / protein-systematic-characterization

All our protocols, data, analysis, and papers related to this project are stored here.
MIT License
2 stars 0 forks source link

Tile DNA Synthesis Simulation #65

Open ericmjl opened 7 years ago

ericmjl commented 7 years ago

@Starnite: all the best for your finals! I'm going to leave here one research questions and one coding challenge that will be useful for this work.

  1. Say you're given a set of n full-length PB2 sequences that are each split into 150 nucleotide fragments with 30 nucleotide overlaps. Assuming that sequences require perfect matches in order to anneal, how many different combinations of tiled fragments can be assembled? What fraction of the total is the original population?
  2. Can you design and implement an API (application programming interface) that enables a user to treat the code as a black box, complete with documentation? You can choose either a functional or object-oriented interface, both should be good.

I'm off for the next two weeks, and I know you're off for a few weeks during IAP as well, so let me know when you're back, and we can interface again then. All the best for your finals, and have fun in China too!

Starnite commented 7 years ago

Thanks Eric! I’ll be back Jan 7th for a bit more than a week, and can get more lab work done during that time.

Could I come use the 3D printer sometime tomorrow?

ericmjl commented 7 years ago

No problem to print your stuff on the 3D printer! Though do give Wendy priority as she is printing supplies for seal sampling excursions in December. From what I know, she has prints going first thing in the morning (~8 am) and then another one ~2 pm, before she takes off to go home, and both prints should take ~2.5 hours each.

The manuals for operating the printer are available online - search MakerBot Replicator 2 - and near my MacBook Air. Get comfortable using the Mac! 😛

Starnite commented 7 years ago

Got it, thanks :)

Starnite commented 7 years ago

@ericmjl I'm not sure if this is the company you're looking at for tiled DNA synthesis, but this is the only one I found online and it has a fragment size minimum of 400bp: https://sgidna.com/dna_tiles.html

ericmjl commented 7 years ago

@starnite: great search, didn't know about them before. I'll have to compare their pricing to the internal capabilities at the Broad.

Starnite commented 7 years ago

@ericmjl

Greetings from Beijing!

My current understanding of the tiled synthesis process we're attempting is the following:

With regards to the barcode, there are technically 2969^11 combinations of the fragments as my algorithm currently calculates it, but each of the different PB2 sequences only differs from each other by 1 nucleotide on average. There are about 200 unique 5' end fragments, so a 5-base barcode should be sufficient.

Would it be feasible to only barcode the 5' end fragments with known barcodes and then amplify each assembly with just the forward primer? It would be linear instead of exponential, but we'd still get a lot more of the targeted sequence than the rest.

ericmjl commented 7 years ago

Greetings, Beijinger! This is a great update; I'll go section by section for responses.

My current understanding of the tiled synthesis process we're attempting is the following:

  • generate a OLS pool: the 5' end and 3' end fragments will include, from inside out, a pCI overhang and a barcode sequence
  • mix and assemble fragments divide up the mixture of assembled DNA into a large array of PCR tubes for each tube, amplify using primers targeted to a unique pair of barcode sequence remove barcodes and clone into pCI backbone

Yup, this would be right.

With regards to the barcode, there are technically 2969^11 combinations of the fragments as my algorithm currently calculates it, but each of the different PB2 sequences only differs from each other by 1 nucleotide on average. There are about 200 unique 5' end fragments, so a 5-base barcode should be sufficient.

This is good news. 5-base barcodes can be easily encoded on any primer, which are usually a minimum of 20-30 n.t.

Would it be feasible to only barcode the 5' end fragments with known barcodes and then amplify each assembly with just the forward primer? It would be linear instead of exponential, but we'd still get a lot more of the targeted sequence than the rest.

I might need a bit of help visualizing this (right now my brain is a bit muddled). Can you sketch a figure and post it here? How would doing only 5' end barcoding make things more specific? Are there other technical reasons why it would be better to barcode only the 5' end and not both ends?

Starnite commented 7 years ago

If we 1) barcode both ends with a pre-determined barcode, then when we amplify by targeting with a pair of primers, there’s no guarantee that there will be an assembly that has that precise pair of barcodes. If we 2) barcode/target both ends with degenerate bases, then there’s an even smaller chance that the primers will hybridize.

Unless I’m going about this all wrong?

ericmjl commented 7 years ago

If we 1) barcode both ends with a pre-determined barcode, then when we amplify by targeting with a pair of primers, there’s no guarantee that there will be an assembly that has that precise pair of barcodes. If we 2) barcode/target both ends with degenerate bases, then there’s an even smaller chance that the primers will hybridize.

Hmmm, maybe the following example may clarify.

Suppose I give you the following two sequences, annotated below.

seq1 = [cttagaggactcgtaagc]CAATCGGAGGCTCTCAGAGCGATCGA[ttagcgggactcgcgcga]
seq2 = [tcaggtacggcatagag]CAATCGAAGGCTCTCAGAGCGATCGA[cgcgttagactctctag]

The random nucleotide barcodes, in the 5'->3' direction, are placed in the square brackets.

If we wanted to retrieve seq2, we can use primers that end in the barcode sequence for seq2, and if we wanted to retrieve seq1, we can use the corresponding primers.

How would we know which primers were paired with which polymorphism? In this case, we would know if we had done a PacBio sequencing run over the entire pool of re-assembled nucleotides, including seq1, seq2, and the thousands of others present.

Does this clarify it a bit?

The assembly may be random, but we can back-out which primers went with which unique sequence by doing PacBio sequencing instead. At least, that's the theory, must be tested in practice! :smile:

Starnite commented 7 years ago

So we can send in the different sequences all mixed together, and PacBio can sequence every single one individually?

ericmjl commented 7 years ago

That's the thought.

ericmjl commented 7 years ago

If you recall what our original plan was with barcoding each randomly-mutated PB2 sequence, it's roughly the same idea. PacBio sequencing only requires linearized double-stranded DNA as an input, and over half of its reads (160K reads) are 20-60 kb in length, which means that on a library of 2 kb linear fragments, we can get 10-30X read coverage for about 160K individual variants.

Starnite commented 7 years ago

How do they separate out one strand from the other though? Is it just from the fact that only one sequence can be bound to each ZMW well?

I checked the melting temperatures (calculated from GC content) of the tiles, and they’re all within about 4C from each other.

ericmjl commented 7 years ago

How do they separate out one strand from the other though? Is it just from the fact that only one sequence can be bound to each ZMW well?

Hmmm... Here's three exercises that might help you.

Try drawing the tiled DNA assembly yourself using the case of a shorter nucleotide sequence (barcode not necessary) broken up into 3 overlapping tiles, first subjected to PCR. As long as both strands are synthesized (as fragmented ssDNA), there'll be an opportunity for the strands to prime one another and eventually form the full-length fragment.

Now consider the case of two completely different sequences, fragmented into three fragments each, and re-synthesized as dsDNA. They should be able to find each other, and re-synthesize fully following PCR. (I've done this one before in a small scale experiment just after my undergrad.)

Finally, try considering the case where there are two similar (but non-identical) sequences. Random nucleotides should be able to help?

I checked the melting temperatures (calculated from GC content) of the tiles, and they’re all within about 4C from each other.

Great stuff. Did you check that the annealing portions are within 4-5ºC of one another as well? (You can get the º symbol by doing Alt+0).

Starnite commented 7 years ago

Re: Tm of annealing portions – I’ve made it so that the Tms of all the annealing tile overlaps will be within 4ºC of the highest Tm from 30 bp overlaps. It hasn’t changed the number of unique 5’ or 3’ end fragments.