fhalab / evSeq

Computational tools for extremely low-cost, massively parallel amplicon-based sequencing of every variant in protein mutant libraries.
https://fhalab.github.io/evSeq/
Other
28 stars 9 forks source link

Random questions #35

Open rchurt opened 2 years ago

rchurt commented 2 years ago

Hey guys,

I have a few questions from my run with the updated evSeq scripts:

  1. Is there an uninstall script for the local version, in particular for the GUI? Could be worth adding info to docs.
  2. For the BpIndStart and AaIndStart arguments, it gives the range as [0,inf], but I'm not sure how the value could be 0. It seems like that could only happen if there were a "-1" (or "-2" etc.) nt/codon included before the ATG, but in that case it seems like the range should be [-inf,inf].
  3. pip install evSeq not working
  4. Regarding the way that mutations are displayed in the plate map: I'm wondering if there's a better way of showing when there are multiple species in the same well. For example, look at DI01 well A07 here: if I'm interpreting it correctly, it seems like there are two populations in that well (?28V?31W and ?52E?56G) in a roughly 60:40 ratio (though correct me if that's the wrong interpretation). (Something similar happening in well D05). But in the plate map it just shows ?28V?31W?52E_?56G with a 0.38128 alignment frequency, which I interpreted as saying that 38% of reads show all four mutations, and the other 62% don't show all four. Or is the alignment frequency score more an indication of confidence than abundance? If the first interpretation is the right one, can we come up with a better way to plot the data so it's clear that there are multiple populations and what they are? I think you might be able to have multiple tooltips appear when you hover over a square.
  5. I don't think I really understand the Coupled/Decoupled definition, or the relationship between two mutations occurring together and paired-end reads if my reads don't overlap. Could you expand on that a bit more?
  6. In the docs about output, it says of the plate map: "Also note that, because the position information is not given, the output csv files in the previous section should be used for downstream processing," but it seems like the position information for each mutated residue is listed—what did you guys mean by this?

Thanks!

palmhjell commented 2 years ago

Hey Rob,

I think @brucejwittmann can provide some more insight on the underlying process, but I just wanted clarify a few things that might be causing some of the misunderstandings in point 4. This should also clear up point 5, which I think is causing your issues in point 4 to begin with. I had to learn this (since Bruce wrote the code), so hopefully it's helpful.

TL;DR

The TL;DR version is that Decoupled outputs are a way of saying "here's all the mutations we found, but not how they relate to one another" (i.e., no info is given on how mutations occurred within the same DNA molecule that was sequenced, kind of similar to Sanger traces), while the Coupled outputs keep that interdependence, which is important in actual protein function. When you're looking for mixed populations, you should be looking at the Coupled_All file, not the Decoupled_Max. (In fact, Decoupled_Max might be the most useless file we provide, so we should update that in the docs...). The plate maps already represent the information you want in the outline color, and we likely will not add more information to the tooltips because that will quickly get out of hand. It's up to you to recognize where there are mixed populations and adjust accordingly!

Okay, more details below.

On the Decoupled file meanings:

As stated in the docs, the decoupled outputs are independent. Think of the decoupled outputs as all of the sequences that pass QC which are then analyzed for any variability in their sequences relative to the provided reference sequence.

What this means is that, for the case of ?28V?31W, from the alignment frequencies of 0.69857 and 0.68245, ~70% (not 60%) of all of the reads that correspond to well DI01-A07 (as determined by evSeq barcodes) which pass QC found the variability that corresponds to the mutations ?28V?31W. But, again, the decoupled reads are treated independently, so from looking at well A07 in the example you provided, it's not that ~70% have the double mutant ?28V_?31W; it's that ~70% have 28V, and ~70% have 31W.

The other two mutations, 52E and 56G, are found at 45% and 42%. So, in clarifying these percentages, we see they aren't 60:40 and sum to unity – they're ~70:45, and far off from unity. They're not two different populations, they are (mostly) the same population of a 4-site mutant.

The entire read dataset is located in the Decoupled_All output file, which will provide all variability found in a given well. You should find that, for each AaPosition in a given well, these frequencies sum to unity. So what the Decoupled AlignmentFrequency output means is that, independently considering reads, these are the positions that we found that are different from the reference sequence (above a given threshold), and the specific differences. The WellSeqDepth is the total number of reads with the given variability at each position, and you should see that for each well they are mostly the same for a given well, with slight (<10%) differences due to QC/base calling differences.

Coupled reads are what you want:

Now, if you want to consider the physical DNA molecules and those populations, you cannot consider the decoupled reads. That's basically just Sanger sequencing. It can be useful info, so we provide it, but it's not the point of evSeq. You need to give context to each mutation, and that comes from the Coupled files.

The AminoAcids_Coupled_All file provides every DNA sequence found that was not the reference sequence in each well. So if we go to this file and look at well DI01-A07, you see the actual mutation combinations that were picked up from paired-end sequencing for each molecule of DNA that was sequenced. You'll see ?28V?31W?52E_?56G (we'll call it VWEG) with a frequency of 38%. You'll also see TTLL at 22%, and VWVN at 20%. You'll then see some other ones at lower percentages, which eventually sum to 100%. These are the real populations of DNA that were sequenced, and mutations that occur in the same molecule will be kept together here. Well DI01-A07 contained three primary variants, with VWEG, TTLL, and VWVN at 80% of the DNA population (the rest likely being noise).

Representation in a plate map:

Finally, we only pull the variant that occurred with the highest frequency and provide that in the AminoAcids_Coupled_Max file, which is used for the plate maps. 90% of the time you'll find that the "max" variant is a variant with >90% AlignmentFrequency, and thus is the dominant (or only) variant in the well. Sometimes, however, in the case of mixed populations, this value will be much less than 90%. For the example above, VWEG was only at 38%, yet it still is the max. We use the Max file for the plate maps, and thus well A07 will show up as VWEG. However, the alignment frequency (and WellSeqDepth) are also encoded in the plate map, so you should immediately see that this well is outlined in magenta/orange, signifying that this well needs more attention due to low alignment frequency. You can then go to the Coupled_All file and find what else was in there. This is how we account for this.

Hopefully this helps!

brucejwittmann commented 2 years ago

Hi Rob,

Answers to your questions one by one:

  1. evSeq is entirely installed within the context of the conda environment active during installation (most likely the "evSeq" conda environment). To uninstall it, just delete that conda environment. For instance conda env remove -n evSeq. The desktop icon for the GUI is just a shortcut to commands contained in the conda environment, so once the conda environment is gone you can just delete that as you would anything else from your desktop.
  2. The range starts at 0 to account for different indexing systems within protein sequences. Convention for cytochromes P450, for instance, is to ignore the leading methionine when indexing (effectively giving it an index of "0"). Note that the AaIndStart and BpIndStart fields are completely inconsequential to evSeq calculations, and are present purely to provide more detailed outputs.
  3. We need to fix this one and will get on it.
  4. See the answer to 5.
  5. I think Patrick gives a good answer to this question and question 4. My own two cents: The difference between coupled and decoupled is only really significant when there are multiple mutated positions in a sequence. Say, for instance, I have a 50-50 mix of two variants: one has mutations at positions w and x, the other has mutations at positions y and z. In the decoupled output, those mutations all show up as their own entry; because there are only two options at each position -- either mutated or unmutated -- all 4 have a 50% alignment frequency. So how do I know the actual variant sequences? It could, for instance, be one parent sequence with another sequence that has 4 mutations, it could be one sequence with 1 mutation and another with 3, etc. By just counting the frequency of mutation at each position, I cannot determine the true sequences in the population. Rather than this decoupled output, we thus instead need a coupled output that tells us what mutations occurred together on the same reads in the returned sequencing data. This would be able to tell me that there are two variants each with two mutations.
  6. That line is from an old version of the output where we presented combinations without their position information (i.e. the "SimpleCombo" output), you can ignore it. I'll update the docs accordingly. Thanks for the catch :)
rchurt commented 2 years ago

Hey guys,

Thanks for such detailed responses. This helps a lot. Do you think it would be worth adding your response about the uninstall and maybe a technical note about coupled/decoupled reads to the docs? Also, do you trust the ratios of the mixed species you see in the coupled reads, or do you think the library prep PCRs shift them?

The last thing I'm trying to understand is the relationship between coupled/decoupled reads and single-end vs. paired-end reads. In the docs it says: "Decoupled files are the result of counting bases independent of reads (i.e., they do not capture information about how frequently two mutations occur together when considering pair-end sequencing) while Coupled files contain the results of counting bases considering paired reads." To me, it sounds like that's saying that decoupled files represent pseudo-single-end reads and coupled represent paired-end, but based on your explanation, it doesn't seem like the single- or paired-end nature of the read is related to the coupling. Regardless of whether you used single- or paired-end sequencing, coupled always represents single-molecule data and decoupled represents population-level data (meaning that even if two mutations showed up in the same single-end read, they wouldn't show up together in the decoupled data), right? So in that case, it could read: "Decoupled files are the result of counting bases on a population level (i.e., they do not capture information about how frequently multiple mutations occur together in a given DNA molecule), while Coupled files contain the results of counting bases read from a single molecule." Does that seem right?

brucejwittmann commented 2 years ago

Sure we can add some notes to the docs. The uninstall command is already in there, albeit hidden under instructions for updating. I can make it more explicit.

I can also add a little more on coupled/decoupled. Is there something in particular that is confusing/could do with extra clarification?

I would trust the ratios of mixed species. PCR cannot change the frequency of the different mutations, only introduce new combinations. So if you're looking at the decoupled output the only noise impacting ratios is the frequency of base-calling errors during sequencing, which will be low. If you're looking at the coupled output, you might see a few spurious new combinations caused by either base-calling errors or PCR-induced recombination, but in total these will typically make up <5% of the population of returned reads. Except in some contrived situations, the ratio between the dominant species should be maintained despite PCR.

As for paired-end vs not: under the hood, one of the first steps of evSeq is to take paired end reads and combine them to produce a single composite read. You effectively end up with an N x M matrix, where "N" is the number of reads and "M" is the read length. If you just count the frequency of mutations down the columns of this matrix, then you get the output for decoupled; if you identify unique rows and then tally them, then you get the coupled output. I would rephrase your statement to "Decoupled files are the result of counting bases independently (i.e., they do not capture information about how frequently multiple mutations occur together in a given DNA molecule) while Coupled files are the result of counting combinations of bases (i.e., they capture information about the unique DNA molecules returned and their prevalence within the population)."

rchurt commented 2 years ago

That sounds great--I would probably change that statement to how you've written it here ("Decoupled files are the result of counting bases..."), which I think would explain it well enough for 99% of users. The only other things I would consider adding are the examples that Patrick gave (based on the example data in the repo) including specific percentages, why they sum to 100% or not, and how to interpret them physically, as well as an explicit statement that the AminoAcids_Coupled_Max data is what's used for the plate map.

Regarding the ratios, I think that in NGS experiments there's usually a concern that the exponential amplification of PCR can change the ratios of the amplified species. They call it "amplification bias" in the literature--you can check it out and see what you think, but it seems like the only concern that applies in this case would be variability in the starting copy numbers of the species. To mitigate it, they usually recommend using a lot of template DNA and running as few cycles as needed to get enough product. In this case it's probably not a huge concern because most wells will only have a few different species max and often we don't care about the exact ratios, but it could be worth adding a short note to the docs for people who don't do much NGS.

brucejwittmann commented 2 years ago

My understanding is that amplification bias is primarily of concern when amplifying very different templates within the same PCR. A mixed population in evSeq will differ at only a few bases (after all, they're all mapped to the same reference sequence), so I don't think it is really a concern in this case.

Just so that we've got it all in one place, here is our to-do list from this thread:

  1. Add explicit uninstall instructions.
  2. Fix pip install problem.
  3. In the output docs, remove the note "Also note that, because the position information is not given..."
  4. Improve docs on decoupled vs coupled. Use the description in my previous comment and add some examples from Patrick's comment.

Let us know if you have other questions/comments!

rchurt commented 2 years ago

Ok sounds great. Those would be great, and the last thing is that it might be nice if at the end of the run, the evSeq script printed the directory of the output files, just for convenience. But that's all I can think of right now, and thanks for all the help!

palmhjell commented 2 years ago

We can probably add this with the "run finished successfully" statement and may do so in an update, but in the meantime, the output directory is already specified as a user input (defaults to the folder arg +/evSeqOutput, I think), with the specific run folder just being the date-time of the run (YYYYMMDD-HHMMSS format). So, you can either use the time package to print the date and time right before the run (which should be very close to or exactly the output folder name, as it's made on program init), or just use the glob package, sort the list of evSeq output files, and pull the last one.

rchurt commented 2 years ago

Ok makes sense, and fortunately Kadina added this to the Colab notebook I've been using. Just thought it could make life easier for non-coders, but totally up to you guys.

palmhjell commented 2 years ago

Oh sorry, I misinterpreted, I thought you wanted a programmatic way of accessing/checking the output!

It's always going to be the most recent folder made within the evSeqOutput folder (at least until the 100th century, haha), and the GUI, command line, and docs specify what the default output location is (cwd for CLI, folder for GUI since there is no great cwd). Do you find that you have a hard time locating the output? And is it the evSeqOutput folder, or the specific run folder? What's least obvious about where it puts the output? (And is it Colab-specific? We've done very little with Colab due to some previous issues.)

Like I said, printing it out at the end is okay, but I personally would prefer to make defaults more explicit than printing out various values to standard out every time the program is run (as a general programming philosophy; also, changes to docs are quicker to implement than small changes to code and updating versions). Regardless I'll look into making this value more obvious!

rchurt commented 2 years ago

Yeah the confusing part initially was that when I ran git clone I specified a directory for the download (rather than using the cwd), and I was expecting the output to go into the cloned evSeq directory by default (wasn't even paying attention to my cwd when doing either). I found it by digging around, but I thought it might make it easier on users to either change the default output directory to the evSeq directory, or explicitly print the output path, or both. The other reason I had for wanting to print the output path was that in Kadina's Colab, the next cell takes it as input, so it would streamline the process for users and not require them to dig through the directory tree. But that was just my experience of using it, which may not be the average!

rchurt commented 2 years ago

Hey guys, I just noticed that the first link here is broken. Do you have a spreadsheet with the sequences of the outer primers? I found this, but it only has one plate's worth of sequences.

palmhjell commented 2 years ago

Good catch, we'll fix that link. The spreadsheet you found is correct. The reverse barcodes are on the second sheet.

rchurt commented 2 years ago

Ok I see, but aren't there 8 plates worth of outer primers?

palmhjell commented 2 years ago

You only need to order two plate of outer primers. There are 8 plates of outer primer combinations. Please read the first paragraph of the docs section from which you sent us the broken link.

rchurt commented 2 years ago

Oops, ok got it.