Closed fwelsh closed 6 months ago
@fwelsh, your data frames for the filtered sites actually have all the sites. Can you fix this?
@jbloom sorry about that! Mixed up the df names, should be fixed now.
@fwelsh, the perth09 file is empty (has no data). Can you fix that?
@jbloom done now! I forgot that the Perth09 data currently lists sites as str
, not int
. Double checked the output .csv and everything should be fixed.
Our previous PCA-UMAP analysis worked for current HK/19 DMS data, but fails to accurately capture trends in Perth09 data, as these experiments were much noisier. Groups are highly skewed by irrelevant background noise. We could either adjust our approach to normalization to flatten out this noise, or filter the data to relevant antigenic sites before analyzing.
Filtering the data is probably the easier and more effective approach. I've generated dfs for escape data from both HK/19 and Perth09, with columns [['site', 'wildtype', 'mutant', 'escape', 'serum', 'cohort', 'site_escape_sum', and 'site_escape_mean']]. Note that 'escape' is the beta values, not the IC90s, averaged between libA and libB. There's a df for the full protein and for the 24 selected sites that I've used for other summary plots (labeled as '_filt_sites.csv'). These .csv files, along with the notebook used to generate them, are all in scratch_notebooks/figure_drafts/umap_analysis/.
Current goal is to test just running MDS, using either mutation-level or site-level escape, and see how the results look when we use filtered sites. @jbloom let me know if you have any questions about the data!