BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

subsetting SummarizedExperiment and add new colData #211

Open alsmnn opened 6 years ago

alsmnn commented 6 years ago

Hey TCGAbiolinks-Team, I don´t know if it is a SummarizedExperiment Problem or a TCGAbiolinks problem, but I hope someone can help. I have a little problem subsetting a SummarizedExperiment and adding new colData. I want to make a column called surv_times in which I want to paste the days_to_death from all dead patients and days_to_last_follow_up from all patients, who are alive.

Here is my code:

library(TCGAbiolinks)
library(SummarizedExperiment)

query <- GDCquery(project = "TCGA-BLCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts") 

GDCdownload(query,
            method = "api",
            files.per.chunk = 10,
            directory = "GDCdata")

TCGA_BLCA <- GDCprepare(query,
                        summarizedExperiment = TRUE)

notDead <- is.na(TCGA_BLCA$days_to_death)
Dead <- !is.na(TCGA_BLCA$days_to_death)

if (any(notDead == TRUE)) {
        colData(TCGA_BLCA[, notDead])$surv_times <- colData(TCGA_BLCA[, notDead])$days_to_last_follow_up

        colData(TCGA_BLCA[, Dead])$surv_times <- colData(TCGA_BLCA[, Dead])$days_to_death
}

I also tried:

if (any(notDead == TRUE)) {
        TCGA_BLCA[, notDead]$surv_times <- TCGA_BLCA[, notDead]$days_to_last_follow_up

        TCGA_BLCA[, Dead]$surv_times <- TCGA_BLCA[, Dead]$days_to_death
}

because

> identical(colData(TCGA_BLCA)$days_to_death, TCGA_BLCA$days_to_death)
[1] TRUE

Am I subsetting the object wrong or am I missing something essential? The SummarizedExperiment documentation wasn´t really helpful on that topic.

Thanks in advance and best regards from Hamburg, Germany

tiagochst commented 6 years ago

Sorry, but when you do identical(colData(TCGA_BLCA)$days_to_death, TCGA_BLCA$days_to_death) you are looking into the same column. Is that what you wanted to check?

tiagochst commented 6 years ago

You can also do like this. Your code seems to be right.

notDead <- is.na(TCGA_BLCA$days_to_death)
Dead <- !is.na(TCGA_BLCA$days_to_death)
if (any(notDead == TRUE)) {
    TCGA_BLCA$surv_times <- NA
    TCGA_BLCA$surv_times[notDead] <- TCGA_BLCA$days_to_last_follow_up[notDead]
    TCGA_BLCA$surv_times[Dead] <- TCGA_BLCA$days_to_death[Dead]
}
tiagochst commented 6 years ago

Also, there are some cases which does not contains all clinical data. Such as the one below:

screen shot 2018-04-20 at 4 47 12 pm
alsmnn commented 6 years ago

Hey @tiagochst , with identical(colData(TCGA_BLCA)$days_to_death, TCGA_BLCA$days_to_death) I wanted to check if it is okay to omit colData(...). It works flawlessly with your example, thanks, even without TCGA_BLCA$surv_times <- NA. Thanks for the hint with the missing clinical data on that one patient.

Best regards,

alsmnn commented 6 years ago

@tiagochst thanks for your help, but now I have another problem. I have a dataframe with some ENSG identifiers and I want to subset my SummarizedExperiment. My genelist looks something like this

>head(sig.all)
# A tibble: 6 x 2
  EntrezGeneID    HGCSymbol
  <chr>           <chr>    
1 ENSG00000115415 STAT1    
2 ENSG00000117228 GBP1     
3 ENSG00000154451 GBP5     
4 ENSG00000168811 IL12A    
5 ENSG00000138755 CXCL9    
6 ENSG00000225492 GBP1P1

my summarized Experiment is the same from above.

>TCGA_BLCA[sig.all$EntrezGeneID, ]
Error in .SummarizedExperiment.charbound(i, rownames(x), fmt) : 
  <DESeqDataSet>[i,] index out of bounds: ENSG00000109471 ENSG00000226025 ... ENSG00000211803 ENSG00000273024

or with:

>subsetByOverlaps(TCGA_BLCA, sig.all$EntrezGeneID)
Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function ‘subsetByOverlaps’ for signature ‘"DESeqDataSet", "character"’

Thanks in advance

tiagochst commented 6 years ago

You should be able to do the way you are doing. I'm not sure why your object miss those genes.

screen shot 2018-04-23 at 11 30 34 am screen shot 2018-04-23 at 11 31 54 am
alsmnn commented 6 years ago

There were some genes missing, so I had to make a logical vector first and then subset the SE with the vector:

sig_TCGA_BLCA <- TCGA_BLCA[rownames(TCGA_BLCA) %in% sig.all$EntrezGeneID, ]