duplicated wt data in Processed_K50_dG_datasets/K50_dG_Dataset1_Dataset2.csv

JinyuanSun commented 1 year ago

In the Processed_K50_dG_datasets/K50_dG_Dataset1_Dataset2.csv, wildtype can be found with multiple measured ddG, and have suffixes like _wtm, _wte ... What does these suffixes mean? Are they just different runs of the same experiment? For example:

aa_seq name deltaG_t deltaG_c
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb 4.229818214592168 4.039980035014657
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb 3.967958144185877 4.059610892216793
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_wtm 3.8272335034264953 3.783435490801191
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_wte 3.902831248901988 3.90408333226922
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_wty 3.825237814020692 3.8727065390422792
WIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_wth 3.838890387221149 3.8030566454963495
QIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_W1Q 4.216497262605463 4.149678061879862
EIARINAAVRAYGLNYSTFINGLKKAGIELDRKILADMAVRDPQAFEQVVNKVKEALQV 1GYZ.pdb_W1E 4.658430852788203 4.330967807998413

Also, if I'd like to calculate ddG = dG_mut - dG_wild, can I take the average of different deltaG values of the same sequence?

grocklin commented 1 year ago

The suffixes refer to different DNA sequences encoding the same protein sequence, all included in the same experiment. Not sure about the duplicates of the ".pdb" (no suffix) line - possibly different experiments. @KotaroTsuboyama can clarify and also point you to the DNA that goes with each suffix.

KotaroTsuboyama commented 1 year ago

"1GYZ.pdb" represents WT amino acid seq. And in this case, the amino acid sequence was measured in multiple libraries. But the suffixes like _wtm, _wte represent the same amino acids with different DNA sequences as Gabe pointed out; in reverse translation we use codon table optmized for E coli (because we use e coli-based translation system), but to get different DNA sequence with the same amino acids, we intentionally utilized different codon tables (m is mouse, h is human, y is yeast, and e is different version of ecoli table). I hope it makes sense to you. If you have further questions, please let us know!

Rocklin-Lab / cdna-display-proteolysis-pipeline

duplicated wt data in Processed_K50_dG_datasets/K50_dG_Dataset1_Dataset2.csv #2