To do list: porting AcceleratedFailure

piever commented 6 years ago

Now that (hopefully) the Cox PR is nearing merging, and following on #3, I wanted to make a checklist of features to port from AcceleratedFailure (the other survival package, I've renamed it to avoid name clashing).

[x] Cox regression
[x] Simplified Newton Raphson algorithm with backtracking
[x] Nelson-Aalen estimator (estimate cumulative hazard from censored data)
[ ] Nelson-Aalen for Cox model (estimate baseline cumulative hazard from a fit Cox model)
[ ] Tools to compute gradient and hessian of cumulative functions accurately and efficient (culumative functions are often involved in likelihood computations in models with censored data)
[ ] Accelerated failure time models
[ ] Parsing of EventTimes (though I didn't quite figure out in my package what exactly one needs to overload to get CSV to write and read DataFrames when one column is made of EventTimes - maybe it's just easier to have a "time" and a "censored" columns, I'm not sure)

1 and 2 are almost done, 3 and 4 will be next. They will probably require some refactoring as the code is extremely similar for Nelson-Aalen, Kaplan-Meier and cumulative incidence. @ararslan : if you prefer I can wait that you refactor the code for cumulative incidence before adding Nelson-Aalen(in which case I would start by porting the tools for differentiation of cumulatives), otherwise I can try refactoring myself.

ararslan commented 6 years ago

Excellent list!

I'm not sure when I'll have time to do the refactoring to avoid duplication between Kaplan-Meier, cumulative incidence, and Nelson-Aalen, so a PR for that would be welcome if you're up for it. I took a stab at it a couple weeks ago but was pretty unsatisfied with the result, so I never pushed the branch. The general idea was to have a mutable state object for each estimator that gets updated in the loop, then all that's required for each estimator is defining methods for an updating function. That may still be a viable approach; my implementation of it just left a lot to be desired.

It's also unclear to me how best to read and write EventTimes. How does R handle it with write.csv (or whatever they call it)? It could possibly be as simple as defining show(::IO, ::EventTime) and parse(::Type{EventTime}, ::String) to get text-based I/O to work, but I'm not sure exactly.

Gradients and hessians could potentially be handled by a dependency on a package like ForwardDiff, and optimization with Optim and LineSearches. I'd be fine with dependencies on those packages.

piever commented 6 years ago

I'm unsure that my implementation will be better than yours... I guess one could have an update_function(::Type{T}, k, n) where T that would be overloaded for every estimator (KaplanMeier, NelsonAalen...). I'll give it a shot and see how it goes.

EDIT: actually it's much easier to refactor to add Nelson Aalen without code duplication, as exactly the same quantities are needed, the cumulative incidence case seems harder

As for the gradient and hessian tools, IIRC when I developed my stuff it was not possible to use automatic differentiation as most of the distributions pdfs and cdfs were computed with some external R library. I'm not sure if that's still the case. If not I'd be curious to compare the solution in AcceleratedFailure with say ForwardDiff.

nalimilan commented 6 years ago

It's also unclear to me how best to read and write EventTimes. How does R handle it with write.csv (or whatever they call it)? It could possibly be as simple as defining show(::IO, ::EventTime) and parse(::Type{EventTime}, ::String) to get text-based I/O to work, but I'm not sure exactly.

I don't think R really supports writing and loading this kind of data to/from CSV. Writing a Surv object to CSV gives two columns (time and status), with one event per row (Surv is a matrix in R, not a single even). If it's stored as a data frame column, it gets transformed into two separate columns (!).

piever commented 6 years ago

I think we would need to first overload parse for EventTime. Something like:

Base.parse(T::Type{EventTime{S}}, str::String) where {S} =
    endswith(str, '+') ? EventTime(parse(S, chop(str)), false) : EventTime(parse(S, str), true)

And then the following should work:

CSV.read(filename, types = Dict("event" => EventTime{Int64}))

I'm not sure if there's a way to save this information in the file so that the user doesn't have to provide it every time.

JuliaStats / Survival.jl

To do list: porting AcceleratedFailure #5