Closed sjspielman closed 1 year ago
Now ready for review!
I wanted to look further into the stranded samples that group with polyA...
#### Investigate the stranded samples that are clustering with polyA
# get their biospecimen ids - THERE ARE ONLY 6!
biospecimen_ids <- umap_df %>%
# y > 10 is region of UMAP where polyA's are
filter(UMAP2 > 10,
RNA_library == "stranded") %>%
pull(Kids_First_Biospecimen_ID)
# get their sample ids
sample_ids <- metadata_df %>%
filter(Kids_First_Biospecimen_ID %in% stranded_ids) %>%
pull(sample_id)
# what are all the RNASeq's associated with that sample?
metadata_df %>%
filter(sample_id %in% sample_ids,
experimental_strategy == "RNA-Seq") %>%
select(sample_id,
Kids_First_Biospecimen_ID,
RNA_library) %>%
mutate(one_of_those_ids = Kids_First_Biospecimen_ID %in% biospecimen_ids) %>%
arrange(sample_id)
## A tibble: 13 x 4
# sample_id Kids_First_Biospecimen_ID RNA_library one_of_those_ids
# <chr> <chr> <chr> <lgl>
# 1 7316-255 BS_FN07P04C stranded TRUE
# 2 7316-255 BS_W4H1D4Y6 poly-A FALSE
# 3 7316-4998 BS_FXJY0MNH stranded TRUE
# 4 7316-536 BS_8QB4S4VA stranded TRUE
# 5 7316-536 BS_QKT3TJVK poly-A FALSE
# 6 7316-85 BS_59ZJWJTF stranded FALSE
# 7 7316-85 BS_BYCX6VK1 poly-A FALSE
# 8 7316-85 BS_QYPHA40N stranded FALSE
# 9 7316-85 BS_SB12W1XT stranded TRUE
# 10 A16915 BS_7WM3MNZ0 poly-A FALSE
# 11 A16915 BS_KABQQA0T stranded TRUE
# 12 A18777 BS_68KX6A42 poly-A FALSE
# 13 A18777 BS_D7XRFE0R stranded TRUE
# How common is it in the cohort to have polyA and stranded for same sample_id?
# Not _terribly_ common
metadata_df %>%
filter(experimental_strategy == "RNA-Seq") %>%
count(sample_id) %>%
filter(n > 1) %>%
tally()
# A tibble: 1 x 1
# n
# <int>
#1 23
All of those stranded samples also have polyA's, which does sometimes happen across the cohort but for more than just those 6. Mislabeling or sample prep problems would not surprise me.
I'll also note that this plot was made with slightly different wrangling code, such that the RNA library strategies were annotated based on their presence in the expression data tables, not as recorded in the metadata file. The resulting plot was the same, so it is what it is!
The toggled in wrangling code:
umap_df <- data.frame(combined_umap$layout) %>%
rename_all(.funs = gsub, pattern = "^X", replacement = "UMAP") %>%
rownames_to_column(var = "Kids_First_Biospecimen_ID") %>%
# Create `RNA_library` column based on presence in FPKM tables:
mutate(
RNA_library = case_when(
Kids_First_Biospecimen_ID %in% colnames(polya) ~ "poly-A",
Kids_First_Biospecimen_ID %in% colnames(stranded) ~ "stranded",
TRUE ~ "an error has occurred"
)
) %>%
# join with cancer group strategy
left_join(
select(metadata_df,
Kids_First_Biospecimen_ID,
cancer_group),
by = "Kids_First_Biospecimen_ID"
) %>%
# join with cancer_group_display
inner_join(
select(palette_df,
cancer_group,
cancer_group_display),
by = "cancer_group"
) %>%
# remove NA cancer group displays
drop_na(cancer_group_display)
This code does not run through CI, so we can merge before checks are completed.
Closes #1695 This PR adds a script to generate a UMAP panel for Figure S7 that shows polyA and stranded libraries. I've colored by cancer group and "shaped" by library strategy. This is being opened as a draft, since I probably should have but did not stack on #1698, because conflicts are expected once #1698 is merged. Still forthcoming are updates to
generate-figures.sh
that incorporate this new script. But in the meantime, here's the new panel!I'm going to also note that I suspect we may want to reduce the number of cancer group legends in the overall figure S7, since this is going to be duplicated at least once. I plan to handle those potential code tweaks in the PR that compiles the full S7 figure.