Filtering Dataset #1 (Group 0)

Hi there!

I'm trying to filter the data to get the Group 0 set, but I'm getting slightly different # sequences than those in the paper.

Q1: I assume that Processed_K50_dG_datasets/Tsuboyama2023_Dataset2_Dataset3_20230416.csv is the same as the "Dataset 1 and Dataset 2" file referenced in the paper (and that this comes from K50_dG_Dataset1_Dataset2)?

Since this processed file has 776,299 lines but Tsuboyama2023_Dataset1_20230416.csv has 1,841,286 lines?

Q2: How do I filter this file to get the Group0 variants? I'm trying to reproduce the number of sequences from Table S1 (586,938 total sequences, 434,556 singles and 152,382 doubles)

I tried using the Single list CSV file, filtering for DMS_group == G0, filtering out low-confidence values from the ddG_ML_float column. But then I get:

607,839 total instead of 586,938 in Table S1
159,051 doubles instead of 152,382 in Table S1

Could you please let me know what I've missed? I assume there's another step of filtering I haven't done.

Rocklin-Lab / cdna-display-proteolysis-pipeline

Filtering Dataset #1 (Group 0) #6