BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
298 stars 112 forks source link

Duplicate entries in RPPA data prevent loading #128

Open ArtemSokolov opened 7 years ago

ArtemSokolov commented 7 years ago

Dear developers,

I am trying to retrieve some RPPA data for the TCGA BRCA project, and I'm running into some issues with duplicate filename entries. My query is as follows:

GDCquery( project = "TCGA-BRCA",
         legacy = TRUE,
         data.category = "Protein expression",
         platform = "MDA_RPPA_Core",
         sample.type = "Primary solid Tumor" )

When the query is composed, I get the following warning: Warning: There are more than one file for the same case. Please verify query results. You can use the command View(getResults(query)) in rstudio. However, the duplicated entries differ only in columns related to the file properties:

getResults(query) %>% group_by( file_name ) %>% summarize_all( funs(n_distinct) ) %>%
    select_if( function(x) any(x > 1) ) %>% colnames
# [1] "file_name"        "updated_datetime" "file_id"          "file_size"
# [5] "id"               "created_datetime" "md5sum"      

This makes it impossible to modify the query to sub-filter the data further and prevent duplication. If I ignore the warning and download the data, the subsequent call to GDCprepare terminates with the following error message: Error in if (any(duplicated(query$results[[1]]$cases)) & query$data.type != : missing value where TRUE/FALSE needed.

How does one go about preparing the downloaded RPPA data, when duplicated entries are present?

tiagochst commented 7 years ago

Hello,

I checked why there is this problem and there are several different archives. mdanderson.org_BRCA.MDA_RPPA_Core.Level_3.2 mdanderson.org_BRCA.MDA_RPPA_Core.Level_3.3 Which I believe they are different revision.

The screen below is one example were the same results is in two archives

screen shot 2017-08-04 at 9 49 44 am

I'll need to change the code to take this archive case into consideration.

ArtemSokolov commented 7 years ago

Thanks, Tiago. Let me know when the new version of your code is available.

tiagochst commented 7 years ago

So I was not able to understand the indexes (https://wiki.nci.nih.gov/display/TCGA/Protein+Array+Data+Format+Specification) . But so far each index has only one revision, but samples might be in several indexes. The solution to download and read the data I found is below (I'll downloading and preparing by indexes and returning a list of data frames for each index)

library(TCGAbiolinks)
protein <- GDCquery( project = "TCGA-BRCA",
                     data.type =  "Protein expression quantification",
                     legacy = TRUE,
                     data.category = "Protein expression",
                     platform = "MDA_RPPA_Core",
                     sample.type = "Primary solid Tumor")

results <- getResults(protein)
# get available indexes
idx <- unique(stringr::str_extract(results$archive_file_name, "Level_[0-9]\\.[0-9]"))

protein_list <- lapply(idx, function (index) {
    query.aux <- protein
    query.aux$results[[1]] <- dplyr::filter(results,grepl(index,results$archive_file_name))
    message("Index:", index," - Revision: ",unique(dplyr::filter(results,grepl(index,results$archive_file_name))$archive_revision))
    GDCdownload(query.aux,chunks.per.download = 20)
    GDCprepare(query.aux,summarizedExperiment = F)
})
names(protein_list) <- idx

# There are repeated cases over the indexes.
table(colnames(protein_list$Level_3.1) %in% colnames(protein_list$Level_3.3))

I added the data here https://drive.google.com/file/d/0B0-8N2fjttG-bnBnZnlVSVJGMDg/view?usp=sharing

I'll update the code, I just need to test the if I did not broke anything.

ArtemSokolov commented 7 years ago

Thanks, Tiago. I have a follow up question about your code segment: results data frame doesn't appear to have an archive_file_name column. Am I missing something?

tiagochst commented 7 years ago

I'm going to check, also there is the hg19 data in firehose. They might have curated the data. For protein expression, it might be more useful. http://firebrowse.org/?cohort=BRCA&download_dialog=true

tiagochst commented 7 years ago

It should have if the last version is installed devtools::install_github("BioinformaticsFMRP/TCGAbiolinks")

screen shot 2017-08-11 at 2 20 21 pm
tiagochst commented 7 years ago

I was checking both objects (from DGC and firehose). For duplicated samples in different archives ,firehose uses the last version.

The code below will give almost the same table, however the last archive as samples with different values. I'm not sure if firehose has the last revision.

library(TCGAbiolinks)
protein <- GDCquery( project = "TCGA-BRCA",
                     data.type =  "Protein expression quantification",
                     legacy = TRUE,
                     data.category = "Protein expression",
                     platform = "MDA_RPPA_Core",
                     sample.type = "Primary solid Tumor")

results <- getResults(protein)
# get available indexes
idx <- unique(stringr::str_extract(results$archive_file_name, "Level_[0-9]\\.[0-9]"))

protein_list <- lapply(idx, function (index) {
    query.aux <- protein
    query.aux$results[[1]] <- dplyr::filter(results,grepl(index,results$archive_file_name))
    message("Index:", index," - Revision: ",unique(dplyr::filter(results,grepl(index,results$archive_file_name))$archive_revision))
    # GDCdownload(query.aux,chunks.per.download = 20)
    y <- GDCprepare(query.aux,summarizedExperiment = F)
    y$`Composite Element REF` <- substr(y$`Composite Element REF`,1,nchar(y$`Composite Element REF`)-4)
    y
})
names(protein_list) <- idx

brca.protein <- protein_list %>% Reduce(function(x,y) {
    full_join(x,y,by="Composite Element REF")}, .,right = T)

# We will keep the last archive version of a sample (remove .x keep .y)
brca.protein <- brca.protein[,grep("\\.x",colnames(brca.protein),invert = TRUE)]
colnames(brca.protein) <- gsub("\\.y","",colnames(brca.protein))

# Get firehose file
firehose.file <- "http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/BRCA/20160128/gdac.broadinstitute.org_BRCA.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0.tar.gz"
downloader::download(firehose.file,basename(firehose.file))
untar(basename(firehose.file))
suppressWarnings({
    firehose.protein <- readr::read_tsv(dir(".",pattern = "BRCA.protein_exp__mda",full.names = T,recursive = T),col_names = T,skip = 1)
    colnames(firehose.protein) <- colnames(readr::read_tsv("~/Downloads/gdac.broadinstitute.org_BRCA.Merge_protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.Level_3.2016012800.0.0/BRCA.protein_exp__mda_rppa_core__mdanderson_org__Level_3__protein_normalization__data.data.txt",col_names = T,n_max = 1))
})

# compare both
round(firehose.protein[match(brca.protein$`Composite Element REF`,firehose.protein$`Sample REF`),colnames(brca.protein)[2:10]], digits = 3) == round(brca.protein[,2:10],digits = 3)
screen shot 2017-08-11 at 4 20 05 pm
ArtemSokolov commented 7 years ago

Thanks for all your pointers, Tiago. I really appreciate it.