Closed rjg2186 closed 6 years ago
You should be able to run the last part as:
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
coad_dataFilt,
Genelist = rownames(coad_dataFilt),
Survresult = F,
ThreshTop=0.67,
ThreshDown=0.33,
group1 = group1,
group2 = group2)
where do you get the bug, as soon as you call the function ?
@tiagochst The process starts with some log in the console as shown below
17425.17424.17423.17422.17421.17420.17419.17418.17417.17416.17415.17414.17413.17412.17411.17410.17409.17408.17407.17406.17405.17404.17403.17402.17401.17400.17399.17398.17397.17396.17395.17394.17393.17392.17391.17390.17389.17388.17387.17386.17385.17384.17383.17382.17381.17380.17379.17378.17377.17376.17375.17374.17373.17372.17371.17370.17369.17368.17367.17366.17365.17364.17363.17362.17361.17360
.
It gets terminated in the middle with below error after some ~ 10 minutes
7386.7385.7384.7383.7382.7381.7380.7379.7378.7377.7376.7375.7374.Error in solve.default(vv, temp2) :
Lapack routine dgesv: system is exactly singular: U[1,1] = 0
Dear Tiago, i also tried the above chunk code to test for my survival analysis in a similar way for seven selected genes:
hg38.coad <- GDCquery(project = "TCGA-COAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
experimental.strategy = "RNA-Seq")
GDCdownload(hg38.coad,files.per.chunk = 50)
clinical <- GDCquery_clinic(project = "TCGA-COAD", type = "clinical")
coad_data <- GDCprepare(query = hg38.coad,
save=TRUE,save.filename = "hg38.coad.updated.htseqcounts.rda")
coad_data<-coad_data[, !coad_data$is_ffpe]
coad_tp <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="TP")]
coad_nt <- coad_data[, TCGAquery_SampleTypes(colnames(coad_data), typesample="NT")]
coad_data_normalized <- TCGAanalyze_Normalization(coad_data[,c(colnames(coad_tp), colnames(coad_nt))],TCGAbiolinks::geneInfoHT)
library(clusterProfiler)
eg = as.data.frame(bitr(rownames(coad_data_normalized),
fromType="ENSEMBL",
toType="SYMBOL",
OrgDb="org.Hs.eg.db")) #drop=TRUE
eg <- eg[!duplicated(eg$SYMBOL),]
eg.ensembl <- as.character(eg$ENSEMBL)
dd <- match(eg.ensembl,rownames(coad_data_normalized))
dat.filt <-coad_data_normalized[dd,]
rownames(dat.filt) <- as.character(eg$SYMBOL)
dat.sel <- dat.filt[sel.genes,] # a vector of selected gene symbols for downstream survival analysis
group1 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(coad_dataFilt), typesample = c("TP"))
tokenStop<- 1
tabSurvKMcomplete <- NULL
for( i in 1: round(nrow(dat.filt)/100)){ # dat,filt is a matrix with gene symbols in the rows
message( paste( i, "of ",round(nrow(dat.filt)/100)))
tokenStart <- tokenStop
tokenStop <-100*i
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
dat.filt,Genelist = rownames(dat.filt),
Survresult = F,ThreshTop=0.67,
ThreshDown=0.33,
group1 = group1,
group2 = group2)
tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
1 of 234 23365.23364.23363.23362.23361.23360.23359.23358.23357.23356.23355.23354.23353.23352.23351.23350.23349.23348.23347.23346.23345.23344.23343.23342.23341.23340.23339.23338.23337.23336.23335..............................16392. Show Traceback Error in solve.default(vv, temp2) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0
Also the sessionInfo()
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7600)
Matrix products: default
locale:
[1] LC_COLLATE=Greek_Greece.1253 LC_CTYPE=Greek_Greece.1253
[3] LC_MONETARY=Greek_Greece.1253 LC_NUMERIC=C
[5] LC_TIME=Greek_Greece.1253
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] org.Hs.eg.db_3.5.0 AnnotationDbi_1.40.0
[3] IRanges_2.12.0 S4Vectors_0.16.0
[5] Biobase_2.38.0 BiocGenerics_0.24.0
[7] clusterProfiler_3.6.0 DOSE_3.4.0
[9] TCGAbiolinks_2.7.4
Plus when i hit the traceback option in the error:
Error in solve.default(vv, temp2) : Lapack routine dgesv: system is exactly singular: U[1,1] = 0
4.
solve.default(vv, temp2)
3.
solve(vv, temp2)
2.
survdiff(ttime ~ c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN))))
1.
TCGAanalyze_SurvivalKM(clinical, dat.filt, Genelist = rownames(dat.filt), Survresult = F, ThreshTop = 0.67, ThreshDown = 0.33, group1 = group1, group2 = group2)
Could you test it again with the last version, I just commited the changes.
devtools::install_github('BioinformaticsFMRP/TCGAbiolinks')
Dear Tiago, i installed the updated version, but i did not download again the samples, but used the saved rda object of the raw counts. My only changed part of the above code chunk from yesterday is a small update to the final dataset included in the survival analysis:
dat.sel <- dat.filt[sel.genes,] # a vector of selected gene symbols for downstream survival analysis
group1 <- TCGAquery_SampleTypes(colnames(dat.sel), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(dat.sel), typesample = c("TP"))
tokenStop<- 1
tabSurvKMcomplete <- NULL
for( i in 1: round(nrow(dat.sel)/100)){
message( paste( i, "of ",round(nrow(dat.sel)/100)))
tokenStart <- tokenStop
tokenStop <-100*i
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
dat.sel,Genelist = rownames(dat.sel), #7 genes included
Survresult = F,ThreshTop=0.67,
ThreshDown=0.33,
group1 = group1,
group2 = group2)
tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
1 of 0
6.5.4.3.2.1.0.0 of 0
6.5.4.3.2.1.0.
But i also hit:
tabSurvKMcomplete # did not return anything
tabSurvKM
[1] pvalue Group2 Deaths
[3] Group2 Deaths with Top Group2 Deaths with Down
[5] Mean Group2 Top Mean Group2 Down
[7] Mean Group1
<0 rows> (or 0-length row.names)
Thus, there is another problem here ? or any of the selected 7 genes did not show significant results regarding the survival analysis ?
Best,
Efstathios
Dear Tiago, an important update from my previous approach:
this time, i used a bigger vector of 94 selected gene symbols, and used the normalized dataset for survival analysis:
dim(dat.filt)
[1] 23366 506
group1 <- TCGAquery_SampleTypes(colnames(dat.filt), typesample = c("NT"))
group2 <- TCGAquery_SampleTypes(colnames(dat.filt), typesample = c("TP"))
sig.genes # a character vector of 94 gene symbols
tokenStop<- 1
tabSurvKMcomplete <- NULL
for( i in 1: round(nrow(dat.filt)/100)){
message( paste( i, "of ",round(nrow(dat.filt)/100)))
tokenStart <- tokenStop
tokenStop <-100*i
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
dat.filt,Genelist = sig.genes,
Survresult = F,ThreshTop=0.67,
ThreshDown=0.33,
group1 = group1,
group2 = group2)
tabSurvKMcomplete <- rbind(tabSurvKMcomplete,tabSurvKM)
}
After running:
dim(tabSurvKMcomplete)
[1] 1170 7
> head(tabSurvKMcomplete)
pvalue Group2 Deaths Group2 Deaths with Top
0 0.00000000 0 0
EMG1 0.01371046 78 28
KARS 0.01471880 73 46
PPARGC1A 0.02269670 63 43
ATR 0.04177329 62 38
01 0.00000000 0 0
Group2 Deaths with Down Mean Group2 Top
0 0 0.000000
EMG1 50 3.583333
KARS 27 136.487013
PPARGC1A 20 4956.649351
ATR 24 143.967105
01 0 0.000000
Mean Group2 Down Mean Group1
0 0.00000000 0.0000000
EMG1 0.02797203 0.3902439
KARS 3.96815287 57.4634146
PPARGC1A 344.16883117 1737.1951220
ATR 4.06832298 13.9756098
01 0.00000000 0.0000000
tail(tabSurvKMcomplete)
pvalue Group2 Deaths Group2 Deaths with Top
ATR232 0.04177329 62 38
0233 0.00000000 0 0
EMG1233 0.01371046 78 28
KARS233 0.01471880 73 46
PPARGC1A233 0.02269670 63 43
ATR233 0.04177329 62 38
Group2 Deaths with Down Mean Group2 Top
ATR232 24 143.967105
0233 0 0.000000
EMG1233 50 3.583333
KARS233 27 136.487013
PPARGC1A233 20 4956.649351
ATR233 24 143.967105
Mean Group2 Down Mean Group1
ATR232 4.06832298 13.9756098
0233 0.00000000 0.0000000
EMG1233 0.02797203 0.3902439
KARS233 3.96815287 57.4634146
PPARGC1A233 344.16883117 1737.1951220
ATR233 4.06832298 13.9756098
but how can i interpret this ? For instance, what is the meaning of ATR233 ?
@tiagochst Thanks for the update. Its working fine now.
@Jasonmbg ATR233 and other identifiers in that column might be the gene symbol based on your input.
@rjg2186 thank you for your comment: i have a gene identifier ATR among the other included, but not ATR233-so that is why im confused !! as you can see with head(tabSurvKMcomplete), the ATR gene is also depicted, but without a number-also there is a zero-perhaps i misunderstood something here ?
@Jasonmbg The numbers 233 and 232 seems to concatenated with most of the gene identifiers in tail(tabSurvKMcomplete). Probably, you check the input data again. Thanks
@rjg2186 Thank you for your suggestion. I have checked many times my input 94 genes vector and does not seems to have any problem, as i have used it in the past for other analyses. Also it seems weird that the above number might have anything to do with this, in the console while the function was running:
https://www.dropbox.com/s/f5z9anvyavlsil2/console.printing.png?dl=0
Also, the output of the output tabSurvKM from the function TCGAanalyze_SurvivalKM inside the loop:
tabSurvKM
pvalue Group2 Deaths Group2 Deaths with Top
0 0.00000000 0 0
EMG1 0.01371046 78 28
KARS 0.01471880 73 46
PPARGC1A 0.02269670 63 43
ATR 0.04177329 62 38
Group2 Deaths with Down Mean Group2 Top
0 0 0.000000
EMG1 50 3.583333
KARS 27 136.487013
PPARGC1A 20 4956.649351
ATR 24 143.967105
Mean Group2 Down Mean Group1
0 0.00000000 0.0000000
EMG1 0.02797203 0.3902439
KARS 3.96815287 57.4634146
PPARGC1A 344.16883117 1737.1951220
ATR 4.06832298 13.9756098
@tiagochst any ideas about this weird output ?
Best,
Efstathios-Iason
I'm checking the 0 line, but the output is subsetted to have only those with pvalue < 0.05, if you get 0 rows in the output that's because they have pvalue > 0.05. Also if you want all the results just set the pvalue cut-off to 1.
@Jasonmbg Please, could you send me your sig.genes object.
@tiagochst i have just forward them:
so, if i understood well, i can use alternatively just the part:
tabSurvKM<-TCGAanalyze_SurvivalKM(clinical,
dat.filt,Genelist = sig.genes,Survresult = F,ThreshTop=0.67,ThreshDown=0.33,group1 = group1,
group2 = group2)
and inspect the 4 genes which are included above, as significant for overall survival analysis, regarding the 2 groups of high vs low groups of cancer samples, based on the quantile thresholds ? and even this discrepancy consider them valid ?
@tiagochst you mean to first perform a simple filter with
coad_dataFilt <- TCGAanalyze_Filtering(tabDF = coad_data_normalized,method = "quantile",qnt.cut = 0.25) ?
and then intersect these genes that passed the filter with the 94 genes ? and continue with the common ? in order to avoid the inclusion of the above row with 0 in the tabSurvKM output above ?
It is not required to use TCGAanalyze_Filtering function, this would just reduce the number of genes you would evaluate in the TCGAanalyze_SurvivalKM function. I was just wondering how you had a gene with values 0 for all samples.
Yes, you are right-it is indeed weird-you think that it might have to do anything with the normalization process ? or something else ?-i mentioned filtering because after the above function implementation, 4 of the 94 genes are excluded-
Nevertheless, you think at least with the above procedure, the results for the 4 significant genes are still valid ?
Finally, in order not to disturb you further-is there a way to visualize the Kaplan Meier with these 4 genes, to inspect which are "favorable" or "unfavorable" consering the survival probability ? with the scope of better explanation except the output of the above function ?
Best,
Efstathios
If you set Survresult = T
it makes an R plot.
TCGAanalyze_SurvivalKM(clinical,
dat.filt,
Genelist = rownames(tabSurvKM),
Survresult = T,
ThreshTop = 0.67,
ThreshDown = 0.33,
group1 = group1,
group2 = group2)
@tiagochst thank you for your notification-actually i forward another screenshot from my Rstudio session, as the images are not shown usually in the figure slot, but below the code chunk (and also i can't save them)-nevertheless, my final crusial question to close this discussion, is the following:
https://www.dropbox.com/s/uwivrddl8v1byyz/example.survivalplot.Rstudio.jpg?dl=0
Thus, in the above image shown, essentially with TCGAanalyze_SurvivalKM function the overall survival is illustrated, correct ? as also the high and low expressed groups of samples, are the cancer samples which were defined from the cutoffs, but in the total samples, right ?
Best, Efstathios
Updated the current version using
devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")
The following error occurs with TCGAanalyze_SurvivalKM
The commands used are similar to the one in issue #170 , except the last part.