BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
290 stars 111 forks source link

Fix duplicates issue when querying expression data #631

Open ShixiangWang opened 3 months ago

ShixiangWang commented 3 months ago

I was trying to obtain STAR gene expression data from multiple datasets including TCGA, TARGET, and CPTAC. Some errors happened when preparing some projects, and they are mainly related to duplicated records, especially in the CPTAC-3. I tested the code and made a workaround. This PR also amended some data checks to make sure the data preparation process went smoothly.

o Preparing output
-------------------
Downloading data for project CPTAC-3
Of the 2340 files for download 2340 already exist.
All samples have been already downloaded
Removing duplicated cases (with older updated time)
 => 41 records removed

Code to query the data.

library(TCGAbiolinks)
#devtools::load_all("/Volumes/EPan/Repos/TCGAbiolinks/")
#install.packages("/Volumes/EPan/Repos/TCGAbiolinks/", repos = NULL, type = "source")

prjs = TCGAbiolinks::getGDCprojects()
tcga_prjs = prjs$project_id[startsWith(prjs$project_id, "TCGA")]
targ_prjs = prjs$project_id[startsWith(prjs$project_id, "TARGET")]
cpta_prjs = prjs$project_id[startsWith(prjs$project_id, "CPTAC")]

# mRNA pipeline: https://gdc-docs.nci.nih.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

dir.create("data")
dir.create("data/count")
dir.create("data/tpm")

# GDC data download and prepare ------------------------------------------------

for (i in c(tcga_prjs, targ_prjs, cpta_prjs)) {
  message("Handling ", i)
  # if (file.exists(glue::glue("data/tpm/{i}.exp.tpm.rds"))) {
  #   message("handled already, skipping...")
  #   next()
  # }
  query.exp.hg38 <- GDCquery(
    project = i,
    data.category = "Transcriptome Profiling",
    data.type = "Gene Expression Quantification",
    workflow.type = "STAR - Counts"
  )

  GDCdownload(query.exp.hg38)
  #debug(GDCprepare)
  #debug(readTranscriptomeProfiling)
  #debug(colDataPrepare)
  expdat <- GDCprepare(
    query = query.exp.hg38
  )
  gc()

  message("saving...")
  saveRDS(SummarizedExperiment::assays(expdat)[["unstranded"]], file = glue::glue("data/count/{i}.exp.count.rds"))
  saveRDS(SummarizedExperiment::assays(expdat)[["tpm_unstrand"]], file = glue::glue("data/tpm/{i}.exp.tpm.rds"))
  message("done")
  gc()
}