AlexsLemonade / OpenPBTA-analysis

The analysis repository for the Open Pediatric Brain Tumor Atlas Project
Other
100 stars 67 forks source link

Include sample sizes in lancet figure #1296

Closed sjspielman closed 2 years ago

sjspielman commented 2 years ago

The S2 figure panel showing lancet WXS/WGS is updated to reflect sample sizes with a new plot subtitle: 98 samples from 13 patients. This information can then be reproducibly included in the manuscript.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Questions for reviewers

Results

What types of results are included (e.g., table, figure)?

Figure with added subtitle

jharenza commented 2 years ago

Just noting to double check the Ns here, per this comment

sjspielman commented 2 years ago

Looking more into the N situation, I think I found the discrepancy (code/figure updated in last commit).

@jharenza calculated as in this comment

> matched_participants <- v21 %>%
+   filter(experimental_strategy != "RNA-Seq") %>%
+   group_by(Kids_First_Participant_ID) %>%
+   summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
+   filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
+   pull(Kids_First_Participant_ID) 

> # Get the biospecimen IDS for these participants.
> bs <- v21 %>%
+   filter(Kids_First_Participant_ID %in% matched_participants) %>%
+   filter(experimental_strategy != "RNA-Seq" & !is.na(pathology_diagnosis)) %>%
+   select(Kids_First_Participant_ID, Kids_First_Biospecimen_ID, tumor_descriptor, experimental_strategy) %>%
+   unique() %>%
+   mutate(pt_desc = paste(Kids_First_Participant_ID, tumor_descriptor, sep = "_")) %>%
+   group_by(pt_desc, experimental_strategy) %>%
+   tally()

> sum(bs$n)
[1] 52

> as.data.frame(table(bs$pt_desc))
                                          Var1 Freq
1                PT_0MXPTTM3_Initial CNS Tumor    2
2                PT_1E3E6GMF_Initial CNS Tumor    2
3                PT_9GKVQ9QS_Initial CNS Tumor    2
4                PT_HGM20MW7_Initial CNS Tumor    2
5                PT_KBFM551M_Initial CNS Tumor    2
6  PT_KBFM551M_Progressive Disease Post-Mortem    1
7                PT_KTRJ8TFY_Initial CNS Tumor    2
8  PT_KTRJ8TFY_Progressive Disease Post-Mortem    1
9                PT_KZ56XHJT_Initial CNS Tumor    2
10                     PT_KZ56XHJT_Progressive    2
11 PT_KZ56XHJT_Progressive Disease Post-Mortem    1
12               PT_M23Q0DC3_Initial CNS Tumor    2
13               PT_M9XXJ4GR_Initial CNS Tumor    2
14               PT_NK8A49X5_Initial CNS Tumor    2
15                     PT_NK8A49X5_Progressive    2
16               PT_QA9WJ679_Initial CNS Tumor    2
17               PT_VPEMAQBN_Initial CNS Tumor    2
18               PT_WGVEF96B_Initial CNS Tumor    2

In the figures/scripts/supp-snv-callers-panels.R script, the follow code is used as directly adapted from the original notebook.

    # Retrieve all the participant IDs for participants that have both WGS and WXS data.
    matched_participants <- metadata %>%
      filter(experimental_strategy != "RNA-Seq") %>%
      group_by(Kids_First_Participant_ID) %>%
      summarize(strategies = paste0(experimental_strategy, collapse = ",")) %>%
      filter(grepl("WXS", strategies) & grepl("WGS", strategies)) %>%
      pull(Kids_First_Participant_ID) 

    # Get the biospecimen IDS for these participants.
    biospecimens <- metadata %>%
      filter(Kids_First_Participant_ID %in% matched_participants) %>%
      pull(Kids_First_Biospecimen_ID)

    # below is now gone from code but showing here to compare
    #n_participants <- length(matched_participants) # 13
    #n_samples <- length(biospecimens) # 98 ## BUT THIS DISAGREES - MY CALC IS IN THE WRONG PLACE!

    # Set up the Lancet data from the SQL database and only keep the biospecimens we identified.
    lancet <- tbl(con, "lancet") %>%
      select(
        join_cols, "VAF" #matches `cols_to_keep` in original notebook
      ) %>%
      inner_join(
        select(
          tbl(con, "samples"),
          Tumor_Sample_Barcode = Kids_First_Biospecimen_ID,
          experimental_strategy, 
          short_histology,
          Kids_First_Participant_ID
          )
      ) %>%
      filter(Tumor_Sample_Barcode %in% biospecimens) %>%
      as.data.frame()

     ## THIS SHOULD BE WHERE I CALC!! Now the code has this!!
     n_participants <- length(unique(lancet$Kids_First_Participant_ID)) # 13
     n_samples <- length(unique(lancet$Tumor_Sample_Barcode)) # 52!! matches!

Conclusion: 52 samples from 13 patients. Updating this

sjspielman commented 2 years ago

Woops, @jharenza just realized I forgot to actually request the review here :)