Fig S7 UMAP panel with both polyA and stranded

sjspielman commented 1 year ago

Closes #1695 This PR adds a script to generate a UMAP panel for Figure S7 that shows polyA and stranded libraries. I've colored by cancer group and "shaped" by library strategy. This is being opened as a draft, since I probably should have but did not stack on #1698, because conflicts are expected once #1698 is merged. Still forthcoming are updates to generate-figures.sh that incorporate this new script. But in the meantime, here's the new panel!

I'm going to also note that I suspect we may want to reduce the number of cancer group legends in the overall figure S7, since this is going to be duplicated at least once. I plan to handle those potential code tweaks in the PR that compiles the full S7 figure.

sjspielman commented 1 year ago

Now ready for review!

sjspielman commented 1 year ago

I wanted to look further into the stranded samples that group with polyA...


#### Investigate the stranded samples that are clustering with polyA

# get their biospecimen ids - THERE ARE ONLY 6!
biospecimen_ids <- umap_df %>% 
  # y > 10 is region of UMAP where polyA's are
  filter(UMAP2 > 10, 
         RNA_library == "stranded") %>% 
  pull(Kids_First_Biospecimen_ID) 

# get their sample ids
sample_ids <- metadata_df %>% 
  filter(Kids_First_Biospecimen_ID %in% stranded_ids) %>% 
  pull(sample_id)

# what are all the RNASeq's associated with that sample?
metadata_df %>%
  filter(sample_id %in% sample_ids, 
         experimental_strategy == "RNA-Seq") %>% 
  select(sample_id, 
         Kids_First_Biospecimen_ID, 
         RNA_library) %>%
  mutate(one_of_those_ids = Kids_First_Biospecimen_ID %in% biospecimen_ids) %>%
  arrange(sample_id)
## A tibble: 13 x 4
#   sample_id Kids_First_Biospecimen_ID RNA_library one_of_those_ids
#   <chr>     <chr>                     <chr>       <lgl>           
# 1 7316-255  BS_FN07P04C               stranded    TRUE            
# 2 7316-255  BS_W4H1D4Y6               poly-A      FALSE           
# 3 7316-4998 BS_FXJY0MNH               stranded    TRUE            
# 4 7316-536  BS_8QB4S4VA               stranded    TRUE            
# 5 7316-536  BS_QKT3TJVK               poly-A      FALSE           
# 6 7316-85   BS_59ZJWJTF               stranded    FALSE           
# 7 7316-85   BS_BYCX6VK1               poly-A      FALSE           
# 8 7316-85   BS_QYPHA40N               stranded    FALSE           
# 9 7316-85   BS_SB12W1XT               stranded    TRUE            
# 10 A16915   BS_7WM3MNZ0               poly-A      FALSE           
# 11 A16915   BS_KABQQA0T               stranded    TRUE            
# 12 A18777   BS_68KX6A42               poly-A      FALSE           
# 13 A18777   BS_D7XRFE0R               stranded    TRUE  

# How common is it in the cohort to have polyA and stranded for same sample_id?
# Not _terribly_ common
metadata_df %>%
  filter(experimental_strategy == "RNA-Seq") %>%
  count(sample_id) %>%
  filter(n > 1) %>%
  tally()
#  A tibble: 1 x 1
#       n
#   <int>
#1    23

All of those stranded samples also have polyA's, which does sometimes happen across the cohort but for more than just those 6. Mislabeling or sample prep problems would not surprise me.

sjspielman commented 1 year ago

I'll also note that this plot was made with slightly different wrangling code, such that the RNA library strategies were annotated based on their presence in the expression data tables, not as recorded in the metadata file. The resulting plot was the same, so it is what it is!

The toggled in wrangling code:

umap_df <- data.frame(combined_umap$layout) %>%
  rename_all(.funs = gsub, pattern = "^X", replacement = "UMAP") %>%
  rownames_to_column(var = "Kids_First_Biospecimen_ID") %>%
  # Create `RNA_library` column based on presence in FPKM tables:
  mutate(
    RNA_library = case_when(
      Kids_First_Biospecimen_ID %in% colnames(polya) ~ "poly-A",
      Kids_First_Biospecimen_ID %in% colnames(stranded) ~ "stranded",
      TRUE ~ "an error has occurred"
    )
  ) %>%
  # join with cancer group strategy
  left_join(
    select(metadata_df,
           Kids_First_Biospecimen_ID,
           cancer_group),
    by = "Kids_First_Biospecimen_ID"
  ) %>%
  # join with cancer_group_display
  inner_join(
    select(palette_df,
           cancer_group,
           cancer_group_display),
    by = "cancer_group"
  ) %>%
  # remove NA cancer group displays
  drop_na(cancer_group_display)

sjspielman commented 1 year ago

This code does not run through CI, so we can merge before checks are completed.

AlexsLemonade / OpenPBTA-analysis

Fig S7 UMAP panel with both polyA and stranded #1699