Open ArtemSokolov opened 7 years ago
Hello,
I checked why there is this problem and there are several different archives. mdanderson.org_BRCA.MDA_RPPA_Core.Level_3.2 mdanderson.org_BRCA.MDA_RPPA_Core.Level_3.3 Which I believe they are different revision.
The screen below is one example were the same results is in two archives
I'll need to change the code to take this archive case into consideration.
Thanks, Tiago. Let me know when the new version of your code is available.
So I was not able to understand the indexes (https://wiki.nci.nih.gov/display/TCGA/Protein+Array+Data+Format+Specification) . But so far each index has only one revision, but samples might be in several indexes. The solution to download and read the data I found is below (I'll downloading and preparing by indexes and returning a list of data frames for each index)
library(TCGAbiolinks)
protein <- GDCquery( project = "TCGA-BRCA",
data.type = "Protein expression quantification",
legacy = TRUE,
data.category = "Protein expression",
platform = "MDA_RPPA_Core",
sample.type = "Primary solid Tumor")
results <- getResults(protein)
# get available indexes
idx <- unique(stringr::str_extract(results$archive_file_name, "Level_[0-9]\\.[0-9]"))
protein_list <- lapply(idx, function (index) {
query.aux <- protein
query.aux$results[[1]] <- dplyr::filter(results,grepl(index,results$archive_file_name))
message("Index:", index," - Revision: ",unique(dplyr::filter(results,grepl(index,results$archive_file_name))$archive_revision))
GDCdownload(query.aux,chunks.per.download = 20)
GDCprepare(query.aux,summarizedExperiment = F)
})
names(protein_list) <- idx
# There are repeated cases over the indexes.
table(colnames(protein_list$Level_3.1) %in% colnames(protein_list$Level_3.3))
I added the data here https://drive.google.com/file/d/0B0-8N2fjttG-bnBnZnlVSVJGMDg/view?usp=sharing
I'll update the code, I just need to test the if I did not broke anything.
Thanks, Tiago. I have a follow up question about your code segment: results
data frame doesn't appear to have an archive_file_name
column. Am I missing something?
I'm going to check, also there is the hg19 data in firehose. They might have curated the data. For protein expression, it might be more useful. http://firebrowse.org/?cohort=BRCA&download_dialog=true
It should have if the last version is installed
devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")
I was checking both objects (from DGC and firehose). For duplicated samples in different archives ,firehose uses the last version.
The code below will give almost the same table, however the last archive as samples with different values. I'm not sure if firehose has the last revision.
library(TCGAbiolinks)
protein <- GDCquery( project = "TCGA-BRCA",
data.type = "Protein expression quantification",
legacy = TRUE,
data.category = "Protein expression",
platform = "MDA_RPPA_Core",
sample.type = "Primary solid Tumor")
results <- getResults(protein)
# get available indexes
idx <- unique(stringr::str_extract(results$archive_file_name, "Level_[0-9]\\.[0-9]"))
protein_list <- lapply(idx, function (index) {
query.aux <- protein
query.aux$results[[1]] <- dplyr::filter(results,grepl(index,results$archive_file_name))
message("Index:", index," - Revision: ",unique(dplyr::filter(results,grepl(index,results$archive_file_name))$archive_revision))
# GDCdownload(query.aux,chunks.per.download = 20)
y <- GDCprepare(query.aux,summarizedExperiment = F)
y$`Composite Element REF` <- substr(y$`Composite Element REF`,1,nchar(y$`Composite Element REF`)-4)
y
})
names(protein_list) <- idx
brca.protein <- protein_list %>% Reduce(function(x,y) {
full_join(x,y,by="Composite Element REF")}, .,right = T)
# We will keep the last archive version of a sample (remove .x keep .y)
brca.protein <- brca.protein[,grep("\\.x",colnames(brca.protein),invert = TRUE)]
colnames(brca.protein) <- gsub("\\.y","",colnames(brca.protein))
# Get firehose file
firehose.file <- "http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/gdac.broadinstitute.org_BRCA.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz"
downloader::download(firehose.file,basename(firehose.file))
untar(basename(firehose.file))
suppressWarnings({
firehose.protein <- readr::read_tsv(dir(".",pattern = "BRCA.protein_exp__mda",full.names = T,recursive = T),col_names = T,skip = 1)
colnames(firehose.protein) <- colnames(readr::read_tsv("~/Downloads/gdac.broadinstitute.org_BRCA.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0/BRCA.protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.data.txt",col_names = T,n_max = 1))
})
# compare both
round(firehose.protein[match(brca.protein$`Composite Element REF`,firehose.protein$`Sample REF`),colnames(brca.protein)[2:10]], digits = 3) == round(brca.protein[,2:10],digits = 3)
Thanks for all your pointers, Tiago. I really appreciate it.
Dear developers,
I am trying to retrieve some RPPA data for the TCGA BRCA project, and I'm running into some issues with duplicate filename entries. My query is as follows:
When the query is composed, I get the following warning:
Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio
. However, the duplicated entries differ only in columns related to the file properties:This makes it impossible to modify the query to sub-filter the data further and prevent duplication. If I ignore the warning and download the data, the subsequent call to
GDCprepare
terminates with the following error message:Error in if (any(duplicated(query$results[[1]]$cases)) & query$data.type != : missing value where TRUE/FALSE needed
.How does one go about preparing the downloaded RPPA data, when duplicated entries are present?