Closed hnisonoff closed 5 months ago
Hi Hunter,
I looked into this case and the first measurement (126885) is probably wrong.
To give some background, these 4 sequences are all different DNA sequences and measured in different libraries. The first was measured in "lib2", the second and fourth in "lib3", and the third in "lib4". You can go back and look at the original raw data for all of these sequences in the "Raw_NGS_count_tables" download. Unfortunately the sequence names are actually different between files (some substitutions of : for | ), but you can match things on the DNA sequence.
If you look at the raw data, you can see that 126885 is undersampled in the NGS counts: the "T01", "T13", "C01", and "C13" columns (starting counts before any protease selection) are all very low compared to other sequences in the file (in fact it's the third lowest single mutant). This low number of counts does actually get propagated into our analysis: the deltaG_95CI is 0.6, which we consider high- this specific mutant is actually marked by a black slash in Fig2. Of course in this case this 95CI value doesn't reflect the true error which is probably much larger.
In general we will have duplicates of specific AA sequences because we re-measured the sequence in a different library (or even the same library) for a different purpose with a different stochastically generated DNA sequence. Hopefully most of the time these measurements agree, but sometimes they won't.
In this case the other three measurements of this sequence are not in close agreement, but they do indicate that this sequence is very unstable.
Thanks, Gabe
Thank you for the explanation! Sorry if I missed this in your response, but should mutants 2-4 still be called double mutants? The explanation of the measurement error between different libraries/replicates makes sense!
They are all one mutation away from the 3L1X WT sequence, but they are members of sets of sequences where we investigated double mutants of 3L1X.
Ah sorry I misunderstood the meaning of that column then. Thanks! Closing.
Thank you for releasing this great dataset. I was looking at some of the data and I noticed some odd instances where the same amino acid sequence had very different predicted delta G values in the processed file
Tsuboyama2023_Dataset2_Dataset3_20230416.csv
.One example stood out as a possible processing issue. It appears some sequences are being called double mutants when in reality they are actually duplicates of a single mutant. However for whatever reason their dGs are wild different suggesting perhaps that they are indeed double mutants but that their identity is incorrectly labeled.