Rocklin-Lab / cdna-display-proteolysis-pipeline

21 stars 5 forks source link

Filtering Dataset #1 (Group 0) #6

Open loodvn opened 1 year ago

loodvn commented 1 year ago

Hi there!

I'm trying to filter the data to get the Group 0 set, but I'm getting slightly different # sequences than those in the paper.

Q1: I assume that Processed_K50_dG_datasets/Tsuboyama2023_Dataset2_Dataset3_20230416.csv is the same as the "Dataset 1 and Dataset 2" file referenced in the paper (and that this comes from K50_dG_Dataset1_Dataset2)?

Q2: How do I filter this file to get the Group0 variants? I'm trying to reproduce the number of sequences from Table S1 (586,938 total sequences, 434,556 singles and 152,382 doubles)

I tried using the Single list CSV file, filtering for DMS_group == G0, filtering out low-confidence values from the ddG_ML_float column. But then I get:

Could you please let me know what I've missed? I assume there's another step of filtering I haven't done.

loodvn commented 1 year ago

Actually, I see these numbers match the new Nature manuscript's Extended Data Table 1, so all good!

Could you perhaps update the README (under Zenodo/Processed_K50_dG_datasets.zip) to avoid future confusion about the dataset numbering? Thanks!