BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
296 stars 112 forks source link

TCGAanalyze_SurvivalKM error: "solve.default(vv, temp2) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0" #171

Closed rjg2186 closed 6 years ago

rjg2186 commented 6 years ago

Updated the current version using devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")

The following error occurs with TCGAanalyze_SurvivalKM

solve.default(vv, temp2) :
 Lapack routine dgesv: system is exactly singular: U[1,1] = 0

The commands used are similar to the one in issue #170 , except the last part.

coad <- GDCquery(project = "TCGA-COAD",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  workflow.type = "HTSeq - Counts",
                  experimental.strategy = "RNA-Seq")
GDCdownload(coad, method = "api")
clinical <- GDCquery_clinic(project = "TCGA-COAD", type = "clinical")
coad_data <- GDCprepare(coad)
coad_data<-coad_data[,  #!coad_data$is_ffpe]
coad_tp <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="TP")]
coad_nt <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="NT")]
coad_data_normalized <- TCGAanalyze_Normalization(coad_data[,c(colnames(coad_tp), colnames(coad_nt))],TCGAbiolinks::geneInfoHT)
coad_dataFilt <- TCGAanalyze_Filtering(tabDF = coad_data_normalized,
                                                     method = "quantile", 
                                                     qnt.cut =  0.25)
group1 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("TP"))

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(coad_dataFilt)/100)){
    message( paste( i, "of ", round(nrow(coad_dataFilt)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
                                      coad_dataFilt,
                                      Genelist = rownames(coad_dataFilt)[tokenStart:tokenStop],
                                      Survresult = F,
                                      ThreshTop=0.67,
                                      ThreshDown=0.33,
                                      group1 = group1,
                                      group2 = group2)

    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
tiagochst commented 6 years ago

You should be able to run the last part as:

tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
                                  coad_dataFilt,
                                  Genelist = rownames(coad_dataFilt),
                                  Survresult = F,
                                  ThreshTop=0.67,
                                  ThreshDown=0.33,
                                  group1 = group1,
                                  group2 = group2)

where do you get the bug, as soon as you call the function ?

rjg2186 commented 6 years ago

@tiagochst The process starts with some log in the console as shown below

17425.17424.17423.17422.17421.17420.17419.17418.17417.17416.17415.17414.17413.17412.17411.17410.17409.17408.17407.17406.17405.17404.17403.17402.17401.17400.17399.17398.17397.17396.17395.17394.17393.17392.17391.17390.17389.17388.17387.17386.17385.17384.17383.17382.17381.17380.17379.17378.17377.17376.17375.17374.17373.17372.17371.17370.17369.17368.17367.17366.17365.17364.17363.17362.17361.17360.

It gets terminated in the middle with below error after some ~ 10 minutes

7386.7385.7384.7383.7382.7381.7380.7379.7378.7377.7376.7375.7374.Error in solve.default(vv, temp2) :
  Lapack routine dgesv: system is exactly singular: U[1,1] = 0
Jasonmbg commented 6 years ago

Dear Tiago, i also tried the above chunk code to test for my survival analysis in a similar way for seven selected genes:

hg38.coad <- GDCquery(project = "TCGA-COAD", 
data.category = "Transcriptome Profiling", 
data.type = "Gene Expression Quantification", 
workflow.type = "HTSeq - Counts",
experimental.strategy = "RNA-Seq")

GDCdownload(hg38.coad,files.per.chunk = 50)

clinical <- GDCquery_clinic(project = "TCGA-COAD", type = "clinical")

coad_data <- GDCprepare(query = hg38.coad,
save=TRUE,save.filename = "hg38.coad.updated.htseqcounts.rda")

coad_data<-coad_data[,  !coad_data$is_ffpe] 

coad_tp <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="TP")]

coad_nt <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="NT")]

coad_data_normalized <- TCGAanalyze_Normalization(coad_data[,c(colnames(coad_tp), colnames(coad_nt))],TCGAbiolinks::geneInfoHT)

library(clusterProfiler)

eg = as.data.frame(bitr(rownames(coad_data_normalized),
                        fromType="ENSEMBL",
                        toType="SYMBOL",
                        OrgDb="org.Hs.eg.db")) #drop=TRUE
eg <- eg[!duplicated(eg$SYMBOL),]

eg.ensembl <- as.character(eg$ENSEMBL)
dd <- match(eg.ensembl,rownames(coad_data_normalized))

dat.filt <-coad_data_normalized[dd,]
rownames(dat.filt) <- as.character(eg$SYMBOL)

dat.sel <- dat.filt[sel.genes,] # a vector of selected gene symbols for downstream survival analysis

group1 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("TP"))

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dat.filt)/100)){ # dat,filt is a matrix with gene symbols in the rows 
    message( paste( i, "of ",round(nrow(dat.filt)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
   tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
   dat.filt,Genelist = rownames(dat.filt),
    Survresult = F,ThreshTop=0.67,
    ThreshDown=0.33,
    group1 = group1,
    group2 = group2)

tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

1 of 234 23365.23364.23363.23362.23361.23360.23359.23358.23357.23356.23355.23354.23353.23352.23351.23350.23349.23348.23347.23346.23345.23344.23343.23342.23341.23340.23339.23338.23337.23336.23335..............................16392. Show Traceback Error in solve.default(vv, temp2) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0

Also the sessionInfo()

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)

Matrix products: default

locale:
[1] LC_COLLATE=Greek_Greece.1253  LC_CTYPE=Greek_Greece.1253   
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C                 
[5] LC_TIME=Greek_Greece.1253    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] org.Hs.eg.db_3.5.0    AnnotationDbi_1.40.0 
[3] IRanges_2.12.0        S4Vectors_0.16.0     
[5] Biobase_2.38.0        BiocGenerics_0.24.0  
[7] clusterProfiler_3.6.0 DOSE_3.4.0           
[9] TCGAbiolinks_2.7.4   

Plus when i hit the traceback option in the error:

Error in solve.default(vv, temp2) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0
4.
solve.default(vv, temp2)
3.
solve(vv, temp2)
2.
survdiff(ttime ~ c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN))))
1.
TCGAanalyze_SurvivalKM(clinical, dat.filt, Genelist = rownames(dat.filt), Survresult = F, ThreshTop = 0.67, ThreshDown = 0.33, group1 = group1, group2 = group2)
tiagochst commented 6 years ago

Could you test it again with the last version, I just commited the changes. devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')

Jasonmbg commented 6 years ago

Dear Tiago, i installed the updated version, but i did not download again the samples, but used the saved rda object of the raw counts. My only changed part of the above code chunk from yesterday is a small update to the final dataset included in the survival analysis:

dat.sel <- dat.filt[sel.genes,] # a vector of selected gene symbols for downstream survival analysis

group1 <- TCGAquery_SampleTypes(colnames(dat.sel), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(dat.sel), typesample = c("TP"))

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dat.sel)/100)){ 
    message( paste( i, "of ",round(nrow(dat.sel)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
   tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
   dat.sel,Genelist = rownames(dat.sel), #7 genes included
    Survresult = F,ThreshTop=0.67,
    ThreshDown=0.33,
    group1 = group1,
    group2 = group2)

tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}

1 of  0
6.5.4.3.2.1.0.0 of  0
6.5.4.3.2.1.0.

But i also hit:
tabSurvKMcomplete # did not return anything
tabSurvKM
[1] pvalue                  Group2 Deaths          
[3] Group2 Deaths with Top  Group2 Deaths with Down
[5] Mean Group2 Top         Mean Group2 Down       
[7] Mean Group1            
<0 rows> (or 0-length row.names)

Thus, there is another problem here ? or any of the selected 7 genes did not show significant results regarding the survival analysis ?

Best,

Efstathios

Jasonmbg commented 6 years ago

Dear Tiago, an important update from my previous approach:

this time, i used a bigger vector of 94 selected gene symbols, and used the normalized dataset for survival analysis:

dim(dat.filt)
[1] 23366   506

group1 <- TCGAquery_SampleTypes(colnames(dat.filt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(dat.filt), typesample = c("TP"))

sig.genes # a character vector of 94 gene symbols

tokenStop<- 1

tabSurvKMcomplete <- NULL

for( i in 1: round(nrow(dat.filt)/100)){
    message( paste( i, "of ",round(nrow(dat.filt)/100)))
    tokenStart <- tokenStop
    tokenStop <-100*i
    tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
                                      dat.filt,Genelist = sig.genes,
                                      Survresult = F,ThreshTop=0.67,
                                      ThreshDown=0.33,
                                      group1 = group1,
                                      group2 = group2)

    tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
After running:

dim(tabSurvKMcomplete)
[1] 1170    7
> head(tabSurvKMcomplete)
             pvalue Group2 Deaths Group2 Deaths with Top
0        0.00000000             0                      0
EMG1     0.01371046            78                     28
KARS     0.01471880            73                     46
PPARGC1A 0.02269670            63                     43
ATR      0.04177329            62                     38
01       0.00000000             0                      0
         Group2 Deaths with Down Mean Group2 Top
0                              0        0.000000
EMG1                          50        3.583333
KARS                          27      136.487013
PPARGC1A                      20     4956.649351
ATR                           24      143.967105
01                             0        0.000000
         Mean Group2 Down  Mean Group1
0              0.00000000    0.0000000
EMG1           0.02797203    0.3902439
KARS           3.96815287   57.4634146
PPARGC1A     344.16883117 1737.1951220
ATR            4.06832298   13.9756098
01             0.00000000    0.0000000

tail(tabSurvKMcomplete)
                pvalue Group2 Deaths Group2 Deaths with Top
ATR232      0.04177329            62                     38
0233        0.00000000             0                      0
EMG1233     0.01371046            78                     28
KARS233     0.01471880            73                     46
PPARGC1A233 0.02269670            63                     43
ATR233      0.04177329            62                     38
            Group2 Deaths with Down Mean Group2 Top
ATR232                           24      143.967105
0233                              0        0.000000
EMG1233                          50        3.583333
KARS233                          27      136.487013
PPARGC1A233                      20     4956.649351
ATR233                           24      143.967105
            Mean Group2 Down  Mean Group1
ATR232            4.06832298   13.9756098
0233              0.00000000    0.0000000
EMG1233           0.02797203    0.3902439
KARS233           3.96815287   57.4634146
PPARGC1A233     344.16883117 1737.1951220
ATR233            4.06832298   13.9756098

but how can i interpret this ? For instance, what is the meaning of ATR233 ?

rjg2186 commented 6 years ago

@tiagochst Thanks for the update. Its working fine now.

rjg2186 commented 6 years ago

@Jasonmbg ATR233 and other identifiers in that column might be the gene symbol based on your input.

Jasonmbg commented 6 years ago

@rjg2186 thank you for your comment: i have a gene identifier ATR among the other included, but not ATR233-so that is why im confused !! as you can see with head(tabSurvKMcomplete), the ATR gene is also depicted, but without a number-also there is a zero-perhaps i misunderstood something here ?

rjg2186 commented 6 years ago

@Jasonmbg The numbers 233 and 232 seems to concatenated with most of the gene identifiers in tail(tabSurvKMcomplete). Probably, you check the input data again. Thanks

Jasonmbg commented 6 years ago

@rjg2186 Thank you for your suggestion. I have checked many times my input 94 genes vector and does not seems to have any problem, as i have used it in the past for other analyses. Also it seems weird that the above number might have anything to do with this, in the console while the function was running:

https://www.dropbox.com/s/f5z9anvyavlsil2/console.printing.png?dl=0

Also, the output of the output tabSurvKM from the function TCGAanalyze_SurvivalKM inside the loop:

tabSurvKM
             pvalue Group2 Deaths Group2 Deaths with Top
0        0.00000000             0                      0
EMG1     0.01371046            78                     28
KARS     0.01471880            73                     46
PPARGC1A 0.02269670            63                     43
ATR      0.04177329            62                     38
         Group2 Deaths with Down Mean Group2 Top
0                              0        0.000000
EMG1                          50        3.583333
KARS                          27      136.487013
PPARGC1A                      20     4956.649351
ATR                           24      143.967105
         Mean Group2 Down  Mean Group1
0              0.00000000    0.0000000
EMG1           0.02797203    0.3902439
KARS           3.96815287   57.4634146
PPARGC1A     344.16883117 1737.1951220
ATR            4.06832298   13.9756098

@tiagochst any ideas about this weird output ?

Best,

Efstathios-Iason

tiagochst commented 6 years ago

I'm checking the 0 line, but the output is subsetted to have only those with pvalue < 0.05, if you get 0 rows in the output that's because they have pvalue > 0.05. Also if you want all the results just set the pvalue cut-off to 1.

tiagochst commented 6 years ago

@Jasonmbg Please, could you send me your sig.genes object.

Jasonmbg commented 6 years ago

@tiagochst i have just forward them:

so, if i understood well, i can use alternatively just the part:

tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
dat.filt,Genelist = sig.genes,Survresult = F,ThreshTop=0.67,ThreshDown=0.33,group1 = group1,
group2 = group2)

and inspect the 4 genes which are included above, as significant for overall survival analysis, regarding the 2 groups of high vs low groups of cancer samples, based on the quantile thresholds ? and even this discrepancy consider them valid ?

Jasonmbg commented 6 years ago

@tiagochst you mean to first perform a simple filter with

coad_dataFilt <- TCGAanalyze_Filtering(tabDF = coad_data_normalized,method = "quantile",qnt.cut = 0.25) ?

and then intersect these genes that passed the filter with the 94 genes ? and continue with the common ? in order to avoid the inclusion of the above row with 0 in the tabSurvKM output above ?

tiagochst commented 6 years ago

It is not required to use TCGAanalyze_Filtering function, this would just reduce the number of genes you would evaluate in the TCGAanalyze_SurvivalKM function. I was just wondering how you had a gene with values 0 for all samples.

Jasonmbg commented 6 years ago

Yes, you are right-it is indeed weird-you think that it might have to do anything with the normalization process ? or something else ?-i mentioned filtering because after the above function implementation, 4 of the 94 genes are excluded-

Nevertheless, you think at least with the above procedure, the results for the 4 significant genes are still valid ?

Finally, in order not to disturb you further-is there a way to visualize the Kaplan Meier with these 4 genes, to inspect which are "favorable" or "unfavorable" consering the survival probability ? with the scope of better explanation except the output of the above function ?

Best,

Efstathios

tiagochst commented 6 years ago

If you set Survresult = T it makes an R plot.

TCGAanalyze_SurvivalKM(clinical,
                       dat.filt,
                       Genelist = rownames(tabSurvKM),
                       Survresult = T,
                       ThreshTop  = 0.67,
                       ThreshDown = 0.33,
                       group1 = group1,
                       group2 = group2)
Jasonmbg commented 6 years ago

@tiagochst thank you for your notification-actually i forward another screenshot from my Rstudio session, as the images are not shown usually in the figure slot, but below the code chunk (and also i can't save them)-nevertheless, my final crusial question to close this discussion, is the following:

https://www.dropbox.com/s/uwivrddl8v1byyz/example.survivalplot.Rstudio.jpg?dl=0

Thus, in the above image shown, essentially with TCGAanalyze_SurvivalKM function the overall survival is illustrated, correct ? as also the high and low expressed groups of samples, are the cancer samples which were defined from the cutoffs, but in the total samples, right ?

Best, Efstathios