BioinformaticsFMRP / TCGAbiolinks

TCGAbiolinks
http://bioconductor.org/packages/devel/bioc/vignettes/TCGAbiolinks/inst/doc/index.html
289 stars 110 forks source link

TCGAanalyze_SurvivalKM function double counting #528

Open animesh opened 2 years ago

animesh commented 2 years ago

I am using MMRF-COMMPASS data downloaded using https://github.com/marziasettino/MMRFBiolinks to perform survival analysis calculate for ENSG00000196449 and looks like subjects are being double counted?

Length from sample to cfu variable is reduces

> print(c(length(samples_top_mRNA_selected),length(samples_down_mRNA_selected)))
[1] 206 284
> print(c(nrow(cfu_onlyTOP),nrow(cfu_onlyDOWN)))
[1] 196 261

which is probably because some subjects are repeated in data column but when the calculation

> survival::survdiff(ttime ~ c(rep("top",nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN))))
Call:
survival::survdiff(formula = ttime ~ c(rep("top", nrow(cfu_onlyTOP)), 
    rep("down", nrow(cfu_onlyDOWN))))

                                                                         N
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=down 261
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=top  196
                                                                       Observed
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=down       47
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=top        54
                                                                       Expected
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=down     60.6
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=top      40.4
                                                                       (O-E)^2/E
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=down      3.06
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=top       4.59
                                                                       (O-E)^2/V
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=down      7.68
c(rep("top", nrow(cfu_onlyTOP)), rep("down", nrow(cfu_onlyDOWN)))=top       7.68

 Chisq= 7.7  on 1 degrees of freedom, p= 0.006 

Expected values seems to be derived from the sample size 206+284=>490, (101/490)196 ~ 40.4, (101/490)(490-196) ~ 60.6?, why is that double counting allowed here? In general, what happens when a subject occurs multiple times, specially when it occurs in both UP and DOWN part?