BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

Stuck in the "getclinical" loop #193

Open ycl6 opened 6 years ago

ycl6 commented 6 years ago

Hi

I am using the below code to retrieve Clinical information with TCGAbiolinks. However, the process goes into an endless loop with project TCGA-LAML when there is no information to be found.

library(TCGAbiolinks)

getclinical <- function(proj){
        message(proj)
        while(1){
                result = tryCatch({
                        query <- GDCquery(project = proj, data.category = "Clinical")
                        GDCdownload(query)
                        clinical <- GDCprepare_clinic(query, clinical.info = "patient")
                        for(i in c("admin","radiation","follow_up","drug","new_tumor_event")){
                                message(i)
                                aux <- GDCprepare_clinic(query, clinical.info = i)
                                if(is.null(aux)) next
                                # add suffix manually if it already exists
                                replicated <- which(grep("bcr_patient_barcode",colnames(aux), value = T,invert = T) %in% colnames(clinical))
                                colnames(aux)[replicated] <- paste0(colnames(aux)[replicated],".",i)
                                if(!is.null(aux)) clinical <- merge(clinical,aux,by = "bcr_patient_barcode", all = TRUE)
                        }
                        readr::write_csv(clinical,path = paste0(proj,"_clinical_from_XML.csv")) # Save the clinical data into a csv file
                        return(clinical)
                }, error = function(e) {
                        message(paste0("Error Clinical: ", proj))
                })
        }
}

clinical <- TCGAbiolinks:::getGDCprojects()$project_id %>% regexPipes::grep("TCGA",value=T) %>% sort %>%
plyr::alply(1,getclinical, .progress = "text") %>% rbindlist(fill = TRUE) %>% setDF %>% subset(!duplicated(clinical))

Output message:

--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-LAML
--------------------
oo Filtering results
--------------------
----------------
oo Checking data
----------------
ooo Check if there are duplicated cases
ooo Check if there results for the query
-------------------
o Preparing output
-------------------
Downloading data for project TCGA-LAML
Of the 200 files for download 200 already exist.
All samples have been already downloaded
  |====================================================================================================================================================================================| 100%
To get the following information please change the clinical.info argument
=> new_tumor_events: new_tumor_event
=> drugs: drug
=> follow_ups: follow_up
=> radiations: radiation
admin
  |====================================================================================================================================================================================| 100%
radiation
  |                                                                                                                                                                                    |   0%
No information found
follow_up
Error clinical: TCGA-LAML
...
Loop
...
torongs82 commented 6 years ago

Hi @ycl6 Nice coding, thank you for using our tool. Anyway it seems that you were asking for clinical data and radiation information for LAML (Acute Myeloid Leukemia), according to my knowledge there is no radiation therapy for this liquid tumor, as instead you found for the other 32 solid tumors.

@tiagochst when you have time can you consider this exception? thanks.

ycl6 commented 6 years ago

@torongs82

The code that I used was taken from the vignettes :)

I investigated a little, it seems most of the auxiliary information would not return anything, i.e. is.null(aux) == TRUE, but for TCGA-LAML, it actually returns something, making aux an empty data.frame.

So I changed

if(is.null(aux)) next

To

if(is.null(aux) | (is.data.frame(aux) && nrow(aux)==0)) next

This solved the loop problem.