JuliaStats / Survival.jl

Survival analysis in Julia
MIT License
73 stars 23 forks source link

Why/how are events dropped which occur at the same time? #56

Closed itsdfish closed 10 months ago

itsdfish commented 10 months ago

Hello,

I noticed that Julia and R differ on an example from the MICE documentation. Initially, I thought this was related to #50, but upon closer inspection, I noticed that Julia removes duplicate times (R and Julia agree when duplicates are removed). It might be a good idea to explain this in the documentation, or perhaps not remove duplicates (return a warning instead?). I'm not sure what the best approach is or whether there is a best approach, but I think documenting the difference from R would be helpful to users.

example

R Code

require(MASS)

leuk$status <- 1 ## no censoring occurs in leuk data (MASS)
ch <- nelsonaalen(leuk, time, status)
plot(x = leuk$time, y = ch, ylab = "Cumulative hazard", xlab = "Time")
leuk$rchaz = ch
write.csv(leuk, "leuk.csv")

Julia Code

using CSV 
using DataFrames
using Interpolations
using Plots
using Survival

df = CSV.read("leuk.csv", DataFrame)

R1 = fit(NelsonAalen, df.time, df.status)

scatter(R1.events.time, R1.chaz, markersize=2, label = "Julia")
scatter!(df.time, df.rchaz, markersize=2, label = "R")

Version Info

 [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.6.1
  [91a5bcdd] Plots v1.39.0
  [8a913413] Survival v0.3.0
ararslan commented 10 months ago

This package takes the same approach as survival, which gives you estimates for the unique times. It appears that mice is undoing that by mapping the estimates back to the observed times, which reinstates the duplicate entries: https://github.com/amices/mice/blob/1561578c68ccf2118c9a8423f4c8ebe336ae05b6/R/nelsonaalen.R#L48-L50

itsdfish commented 10 months ago

Got it. Thanks for clarifying. Given that using unique times is the norm, I will close this issue.

ararslan commented 10 months ago

Not related to the issue as stated but it's interesting that in your figure Julia has a lower estimate of the hazard than R. I'm not sure why that is.

itsdfish commented 10 months ago

Yeah. That was part of my concern. I don't know how the algorithm works, but my naive hypothesis is that increasing the number of observations occurring at the same time, leads to an increase in the cumulative hazard. This is consistent with the observation that Julia and R agree when the duplicates are removed. I think it could be further tested by increasing the duplicates.

ararslan commented 10 months ago

AFAICT, this appears to be because the mice function nelsonaalen doesn't actually compute the Nelson-Aalen estimate of the cumulative hazard. It fits a Cox proportional hazards model with the default options and obtains the baseline hazard of the result. Notably, coxph defaults to the Efron method of handling ties. This package's estimates agree with the baseline cumulative hazard from coxph when using the exact partial likelihood or Breslow's method.