Include sample sizes in lancet figure

sjspielman commented 2 years ago

The S2 figure panel showing lancet WXS/WGS is updated to reflect sample sizes with a new plot subtitle: 98 samples from 13 patients. This information can then be reproducibly included in the manuscript.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Questions for reviewers

Phrasing ok (patient vs participant)
Should we not include this in plot but just print out of script? Since this isn't a notebook for quick viewing of results, I thought directly adding in plot might be better, especially since this is for the wild west of SI.

Results

What types of results are included (e.g., table, figure)?

Figure with added subtitle

jharenza commented 2 years ago

Just noting to double check the Ns here, per this comment

sjspielman commented 2 years ago

Looking more into the N situation, I think I found the discrepancy (code/figure updated in last commit).

@jharenza calculated as in this comment

> matched_participants <- v21 %>%
+   filter(experimental_strategy != "RNA-Seq") %>%
+   group_by(Kids_First_Participant_ID) %>%
+   summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
+   filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
+   pull(Kids_First_Participant_ID) 

> # Get the biospecimen IDS for these participants.
> bs <- v21 %>%
+   filter(Kids_First_Participant_ID %in% matched_participants) %>%
+   filter(experimental_strategy != "RNA-Seq" & !is.na(pathology_diagnosis)) %>%
+   select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID, tumor_descriptor, experimental_strategy) %>%
+   unique() %>%
+   mutate(pt_desc = paste(Kids_First_Participant_ID, tumor_descriptor, sep = "_")) %>%
+   group_by(pt_desc, experimental_strategy) %>%
+   tally()

> sum(bs$n)
[1] 52

> as.data.frame(table(bs$pt_desc))
                                          Var1 Freq
1                PT_0MXPTTM3_Initial CNS Tumor    2
2                PT_1E3E6GMF_Initial CNS Tumor    2
3                PT_9GKVQ9QS_Initial CNS Tumor    2
4                PT_HGM20MW7_Initial CNS Tumor    2
5                PT_KBFM551M_Initial CNS Tumor    2
6  PT_KBFM551M_Progressive Disease Post-Mortem    1
7                PT_KTRJ8TFY_Initial CNS Tumor    2
8  PT_KTRJ8TFY_Progressive Disease Post-Mortem    1
9                PT_KZ56XHJT_Initial CNS Tumor    2
10                     PT_KZ56XHJT_Progressive    2
11 PT_KZ56XHJT_Progressive Disease Post-Mortem    1
12               PT_M23Q0DC3_Initial CNS Tumor    2
13               PT_M9XXJ4GR_Initial CNS Tumor    2
14               PT_NK8A49X5_Initial CNS Tumor    2
15                     PT_NK8A49X5_Progressive    2
16               PT_QA9WJ679_Initial CNS Tumor    2
17               PT_VPEMAQBN_Initial CNS Tumor    2
18               PT_WGVEF96B_Initial CNS Tumor    2

In the figures/scripts/supp-snv-callers-panels.R script, the follow code is used as directly adapted from the original notebook.

    # Retrieve all the participant IDs for participants that have both WGS and WXS data.
    matched_participants <- metadata %>%
      filter(experimental_strategy != "RNA-Seq") %>%
      group_by(Kids_First_Participant_ID) %>%
      summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
      filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
      pull(Kids_First_Participant_ID) 

    # Get the biospecimen IDS for these participants.
    biospecimens <- metadata %>%
      filter(Kids_First_Participant_ID %in% matched_participants) %>%
      pull(Kids_First_Biospecimen_ID)

    # below is now gone from code but showing here to compare
    #n_participants <- length(matched_participants) # 13
    #n_samples <- length(biospecimens) # 98 ## BUT THIS DISAGREES - MY CALC IS IN THE WRONG PLACE!

    # Set up the Lancet data from the SQL database and only keep the biospecimens we identified.
    lancet <- tbl(con, "lancet") %>%
      select(
        join_cols, "VAF" #matches `cols_to_keep` in original notebook
      ) %>%
      inner_join(
        select(
          tbl(con, "samples"),
          Tumor_Sample_Barcode = Kids_First_Biospecimen_ID,
          experimental_strategy, 
          short_histology,
          Kids_First_Participant_ID
          )
      ) %>%
      filter(Tumor_Sample_Barcode %in% biospecimens) %>%
      as.data.frame()

     ## THIS SHOULD BE WHERE I CALC!! Now the code has this!!
     n_participants <- length(unique(lancet$Kids_First_Participant_ID)) # 13
     n_samples <- length(unique(lancet$Tumor_Sample_Barcode)) # 52!! matches!

Conclusion: 52 samples from 13 patients. Updating this

sjspielman commented 2 years ago

Woops, @jharenza just realized I forgot to actually request the review here :)

AlexsLemonade / OpenPBTA-analysis