CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

Is there a way to use AalenJohansenFitter to plot incidence (hazard)? #1129

Open sokol11 opened 4 years ago

sokol11 commented 4 years ago

Hi. I am trying to visualize the incidence (hazard) function in a competing risk framework.

By default, the AalenJohansenFitter only plots the cumulative density, when .plot() is called.

Is there a way to plot the hazard function with calls to the AalenJohansenFitter API? Or should I be using a different estimator here?

Alternatively, is there a way to transform the cumulative density mathematically to get the hazard function? I am new to survival analysis.

Thanks!

CamDavidsonPilon commented 4 years ago

Hi @sokol11,

hm, not at the moment. The hazard is not directly estimated, so you would need to take the diff of the cumulative hazard and apply a smoothing function (similar to what we do in NelsonAalenFitter).

This can be added to lifelines in the future though.

sokol11 commented 4 years ago

Hey. Thank you so much. Understood.

So right now AalenJohansenFitter does not have the functionality to compute the smoothed hazard like in NelsonAalenFitter but I can try to do that manually?

Also, do you by any chance know if NelsonAalenFitter is appropriate to use in a competing risk framework?

Thanks again!

pzivich commented 4 years ago

Hey @sokol11 you can easily calculate the cause-specific hazard as the number of events of interest at time t divided by the number of number of individuals without an event and uncensored at time t. The cumulative hazard is then just the cumulative sum of those discrete cause-specific hazards. Like @CamDavidsonPilon said, you can also apply a smoothing function. So, you can do this directly from the tabled data treating all competing events as censored observations.

Nelson-Aalen is generally not appropriate for competing risks, but I think it is fine for the scenario you describe (estimating cause-specific hazards and cause-specific cumulative hazards), if you treat the competing events as censored. The problems would start if you tried to convert those hazards into survival or incidence density. Aalen-Johansen is a different estimator that does allow for estimation of the incidence density with competing events.

sokol11 commented 4 years ago

@pzivich Hey thanks so much. Apologies, because I am a little fuzzy on the terminology. I am trying to estimate a point-in-time probability of an event as a function of time. In other words, I am trying to answer the question of what is the probability of event X happening at time t. I thought that such a probability is given by the hazard function. And that incidence and hazard are basically synonyms in this context. Thus, I am a little confused by "incidence density". Could you please elaborate?

In regard to the first couple of sentences, you are basically saying that the point-in-time hazard can be estimated as a simple fraction of events-of-interest at time t to the total number of individuals surviving at time t - 1? Correct?

I should clarify when I say point-in-time, I mean a time period of defined duration, e.g., a month. So, the exact question I am trying to answer is: What is the probability of an event happening in month 1, month 2, and so on? And what does the time graph of that probability looks like?

pzivich commented 4 years ago

I thought that such a probability is given by the hazard function. And that incidence and hazard are basically synonyms in this context. Thus, I am a little confused by "incidence density". Could you please elaborate?

Cameron has a good description of the measures here, but to briefly summarize:

The hazard (h(t)) is the instantaneous rate of the event. The hazard is the conditional probability of an event between two times condition on survival until the first, divided by the difference between the two times. The hazard is evaluated by taking the limit so that the difference between the two times goes to zero. It is unwieldy in words, so I recommend looking at the notation in the docs. So the hazard is a rate, and not a probability. Therefore, I don't think you would actually want the hazard. Incidence rates are a related concept to hazards, but incidence rates are an average of the hazards over a period of time. The key part of this is that incidence rates are similar to hazards. Incidence is not.

The cumulative hazard (H(t)) is the summation (integral) of the hazard up to a particular point in time. Similarly, I don't think you want this quantity.

The cumulative incidence function (what I meant by incidence density) (F(t)) is the probability of the event before time t. it is also written as CDF. This is not a synonym for hazard.

In regard to the first couple of sentences, you are basically saying that the point-in-time hazard can be estimated as a simple fraction of events-of-interest at time t to the total number of individuals surviving at time t - 1? Correct?

The discrete (point-in-time) hazard can be estimated by dividing the number of events at t divided by the number without an event or censored at time t. It should be the same moment in time, not lagged like t-1. However, you can use Nelson-Aalen for this as long as you only look at the cause-specific hazard / cumulative hazard for the event. The procedure is basically the same. The issue is that the conversion tricks in the above referenced docs no longer apply in competing risk analyses. So you can't go directly from the hazard to the cumulative incidence function

So, the exact question I am trying to answer is: What is the probability of an event happening in month 1, month 2, and so on? And what does the time graph of that probability looks like?

Based on this, it sounds like you want the CDF. The CDF from the Aalen-Johansen is the cumulative probability of the specific-event occurring before time t in the presence of competing risks. Notice that is may differ a little from what you want: the CDF for the Aalen Johansen is F(t) = Pr(T < t, J=j) where J is the event. It sounds like you are interested in Pr(T < t, J=j | T>t-1), based on the phrasing. This conditions on survival up until the previous time point. This is generally harder to interpret (in my opinion), so I would recommend the standard F(t).

A good article to read and see some examples of what plots would look like is Edwards et al 2016

sokol11 commented 4 years ago

Hey man. This is really helpful. I might post here again if I have a follow-up question, but I think you have set me on the right path already. Again, thank you so much for providing such an exhaustive and clear answer.