BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
284 stars 109 forks source link

No clinical data added to TARGET-NBL #616

Open AndriesDeKoker opened 5 months ago

AndriesDeKoker commented 5 months ago

Performing this: query_methyl <- GDCquery( project <- "TARGET-NBL", data.category = 'DNA Methylation', platform = 'Illumina Human Methylation 450', access = 'open', data.type = 'Methylation Beta Value' ) GDCdownload(query_methyl) dna.meth <- GDCprepare(query_methyl, summarizedExperiment = TRUE)

All clinical data is gone, all NA values Is this a bug? Any work-around suggestions?

AndriesDeKoker commented 5 months ago

Extra comment: when multiple TARGET-projects are loaded (e.g. OS), it is the NBL one that also leads to the failure of adding clinical data of OS. When TARGET-OS is loaded seperatly, there is no issue

AndriesDeKoker commented 5 months ago

Not sure, but was looking in the prepare.R file, could it relate to lines in this part?

`colDataPrepare <- function(barcode){

For the moment this will work only for TCGA Data

# We should search what TARGET data means
message("Starting to add information to samples")
ret <- NULL

if(all(grepl("TARGET",barcode))) ret <- colDataPrepareTARGET(barcode)
if(all(grepl("TCGA",barcode))) ret <- colDataPrepareTCGA(barcode)
if(all(grepl("MMRF",barcode))) ret <- colDataPrepareMMRF(barcode)

# How to deal with mixed samples "C3N-02003-01;C3N-02003-021" ?
# Check if this breaks the package
if(any(grepl("C3N-|C3L-",barcode))) {
    ret <- data.frame(
        sample = map(barcode,.f = function(x) stringr::str_split(x,";") %>% unlist) %>% unlist()
    )
}

if(is.null(ret)) {
    ret <- data.frame(
        sample = barcode %>% unique,
        stringsAsFactors = FALSE
    )
}

message(" => Add clinical information to samples")
# There is a limitation on the size of the string, so this step will be splited in cases of 100
patient.info <- NULL

patient.info <- splitAPICall(
    FUN = getBarcodeInfo,
    step = 10,
    items = ret$sample
)

if(!is.null(patient.info)) {
    ret$sample_submitter_id <- ret$sample %>% as.character()
    ret <- left_join(ret %>% as.data.frame, patient.info %>% unique, by = "sample_submitter_id")
}
ret$bcr_patient_barcode <- ret$sample %>% as.character()
ret$sample_submitter_id <- ret$sample %>% as.character()
if(!"project_id" %in% colnames(ret)) {
    if("disease_type" %in% colnames(ret)){
        aux <- getGDCprojects()[,c(5,7)]
        aux <- aux[aux$disease_type == unique(ret$disease_type),2]
        ret$project_id <- as.character(aux)
    }
}
# There is no subtype info for target, return as it is
if(any(grepl("TCGA",barcode))) {
    ret <- addSubtypeInfo(ret)
}

# na.omit should not be here, exceptional case
if(is.null(ret)) {
    return(
        data.frame(
            row.names = barcode,
            barcode,
            stringsAsFactors = FALSE
        )
    )
}

# Add purity information from http://www.nature.com/articles/ncomms9971
# purity  <- getPurityinfo()
# ret <- merge(ret, purity, by = "sample", all.x = TRUE, sort = FALSE)

# Put data in the right order
ret <- ret[!duplicated(ret$bcr_patient_barcode),]

# This part might not work with multiple projects
idx <- sapply(
    X = substr(barcode,1,min(stringr::str_length(ret$bcr_patient_barcode))),
    FUN =  function(x) {
        grep(x,ret$bcr_patient_barcode)
    }
)
# the code above does not work, since the projects have different string lengths
if(all(na.omit(ret$project_id) %in% c("TARGET-ALL-P3","TARGET-AML"))) {
    idx <- sapply(gsub("-[[:alnum:]]{3}$","",barcode), function(x) {
        grep(x,ret$bcr_patient_barcode)
    })
}

if(any(ret$project_id == "CPTAC-3",na.rm = T)) {

    # only merge mixed samples
    mixed_samples <- grep(";",barcode,value = T)
    if(length(mixed_samples) > 0){
        mixed_samples <- mixed_samples %>% str_split(";") %>% unlist %>% unique

        ret_mixed_samples <- ret %>% dplyr::filter(sample_submitter_id %in% mixed_samples) %>%
            dplyr::group_by(submitter_id) %>%
            dplyr::summarise_all(~trimws(paste(unique(.), collapse = ';'))) %>%
            as.data.frame()
        ret <- rbind(ret_mixed_samples,ret)
    }
    idx <- match(barcode,ret$bcr_patient_barcode)

    #idx <- sapply(gsub("-[[:alnum:]]{3}$","",barcode), function(x) {
    #    if(grepl(";",x = x)) x <- stringr::str_split(x[1],";")[[1]][1] # mixed samples
    #    grep(x,ret$bcr_patient_barcode)
    #})

}

if(any(ret$project_id %in% c("CMI-MBC","TARGET-NBL"),na.rm = T)) {
    idx <- match(barcode,ret$bcr_patient_barcode)
}

if(is.list(idx)){
    stop(
        "Prepare will not be possible.
        \nIf you are trying to prepare more than
         one different project at a time, please do it separately"
    )
}

ret <- ret[idx,]

if("barcode" %in% colnames(ret)) ret$barcode <- barcode

rownames(ret) <- barcode
return(ret)

} `

AndriesDeKoker commented 5 months ago

trace('colDataPrepare', edit = T) and removing 'TARGET-NBL' from line if (any(ret$project_id %in% c("CMI-MBC", "TARGET-NBL"), na.rm = T)) {

does the trick

tiagochst commented 5 months ago

Sorry, I don't have a lot of time to give support anymore. Yes, indeed that is where the issue happens. I just added a small fix.

I still need to test all TARGET-NBL data before the final solution. That if statement was added for a reason, but I am not sure yet which was the case requiring it.

On Wed, Jan 24, 2024 at 10:22 AM AndriesDeKoker @.***> wrote:

trace('colDataPrepare', edit = T) and removing 'TARGET-NBL' from line if (any(ret$project_id %in% c("CMI-MBC", "TARGET-NBL"), na.rm = T)) {

does the trick

— Reply to this email directly, view it on GitHub https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/616#issuecomment-1908350021, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQ6IM6L6QSAFINABHEM3YQERJVAVCNFSM6AAAAABBXDYRWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBYGM2TAMBSGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>