Open loodvn opened 1 year ago
Actually, I see these numbers match the new Nature manuscript's Extended Data Table 1, so all good!
Could you perhaps update the README (under Zenodo/Processed_K50_dG_datasets.zip
) to avoid future confusion about the dataset numbering? Thanks!
Hi there!
I'm trying to filter the data to get the Group 0 set, but I'm getting slightly different # sequences than those in the paper.
Q1: I assume that
Processed_K50_dG_datasets/Tsuboyama2023_Dataset2_Dataset3_20230416.csv
is the same as the "Dataset 1 and Dataset 2" file referenced in the paper (and that this comes fromK50_dG_Dataset1_Dataset2
)?Tsuboyama2023_Dataset1_20230416.csv
has 1,841,286 lines?Q2: How do I filter this file to get the Group0 variants? I'm trying to reproduce the number of sequences from Table S1 (586,938 total sequences, 434,556 singles and 152,382 doubles)
I tried using the Single list CSV file, filtering for DMS_group == G0, filtering out low-confidence values from the
ddG_ML_float
column. But then I get:Could you please let me know what I've missed? I assume there's another step of filtering I haven't done.