Rocklin-Lab / cdna-display-proteolysis-pipeline

19 stars 4 forks source link

Possible data processing bug #8

Closed hnisonoff closed 3 months ago

hnisonoff commented 4 months ago

Thank you for releasing this great dataset. I was looking at some of the data and I noticed some odd instances where the same amino acid sequence had very different predicted delta G values in the processed file Tsuboyama2023_Dataset2_Dataset3_20230416.csv.

One example stood out as a possible processing issue. It appears some sequences are being called double mutants when in reality they are actually duplicates of a single mutant. However for whatever reason their dGs are wild different suggesting perhaps that they are indeed double mutants but that their identity is incorrectly labeled.

                                     name   mut_type     dG_ML
126885                      3L1X.pdb_D17A       D17A  4.150432
709075        3L1X.pdb_hnet2_3x_D17A:R29R  D17A:R29R  0.870831
709446  3L1X.pdb_dmutv5_17D:29R_D17A:R29R  D17A:R29R -0.547673
709838        3L1X.pdb_hnet2_3x_D17A:K61K  D17A:K61K -0.132003
grocklin commented 3 months ago

Hi Hunter,

I looked into this case and the first measurement (126885) is probably wrong.

To give some background, these 4 sequences are all different DNA sequences and measured in different libraries. The first was measured in "lib2", the second and fourth in "lib3", and the third in "lib4". You can go back and look at the original raw data for all of these sequences in the "Raw_NGS_count_tables" download. Unfortunately the sequence names are actually different between files (some substitutions of : for | ), but you can match things on the DNA sequence.

If you look at the raw data, you can see that 126885 is undersampled in the NGS counts: the "T01", "T13", "C01", and "C13" columns (starting counts before any protease selection) are all very low compared to other sequences in the file (in fact it's the third lowest single mutant). This low number of counts does actually get propagated into our analysis: the deltaG_95CI is 0.6, which we consider high- this specific mutant is actually marked by a black slash in Fig2. Of course in this case this 95CI value doesn't reflect the true error which is probably much larger.

In general we will have duplicates of specific AA sequences because we re-measured the sequence in a different library (or even the same library) for a different purpose with a different stochastically generated DNA sequence. Hopefully most of the time these measurements agree, but sometimes they won't.

In this case the other three measurements of this sequence are not in close agreement, but they do indicate that this sequence is very unstable.

Thanks, Gabe

hnisonoff commented 3 months ago

Thank you for the explanation! Sorry if I missed this in your response, but should mutants 2-4 still be called double mutants? The explanation of the measurement error between different libraries/replicates makes sense!

grocklin commented 3 months ago

They are all one mutation away from the 3L1X WT sequence, but they are members of sets of sequences where we investigated double mutants of 3L1X.

hnisonoff commented 3 months ago

Ah sorry I misunderstood the meaning of that column then. Thanks! Closing.