CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

CoxTimeVaryingFitter() raising `ZeroDivisionError: float division by 0` #768

Open indiana-nikel opened 5 years ago

indiana-nikel commented 5 years ago

Hi there,

I'm looking to fit a simple time-varying model using CoxTimeVaryingFitter() using a cumulative sum of an event occurring. However, every attempt at using the .fit() method results in a ZeroDivisionError: float division by zero. I've attached a sample XLSX of what a few of the observations I'm using look like: gh_issue.xlsx.

The code for fitting looks like this:

import numpy as np
import pandas as pd
from lifelines import CoxTimeVaryingFitter

df_fit = pd.read_excel("gh_issue.xlsx")
ctv = CoxTimeVaryingFitter()
ctv.fit(df_fit, id_col="ID", event_col="Event", start_col="Start", stop_col="Stop", show_progress=True)

The error appears to be with weighted_average = weight_count / tied_death_counts when calling self._get_gradients(). Is this intended behavior? If so, what is it signifying to the user?

Cheers, Indiana

CamDavidsonPilon commented 5 years ago

Thanks for the detailed report! This is defs not intended. I'll investigate this tomorrow.

indiana-nikel commented 5 years ago

Wonderful, thanks for the quick reply!

CamDavidsonPilon commented 5 years ago

So I think the problem is in the dataset¹. You have rows that happen instaneously, example:

Screen Shot 2019-07-08 at 10 31 20 PM

This should be represented, equivalently, as

Screen Shot 2019-07-08 at 10 32 35 PM

When I made these corrections to your dataset, the program ran fine. Ex: gh_issue copy 2.xlsx

I'm curious, did you use lifelines to generate this dataset? How could this have been more obvious to fix?

¹ However lifelines should have a check for this.

indiana-nikel commented 5 years ago

Hi Cam,

This dataset is a subset of what I'm working with and the format was generated using both to_long_format and add_covariate_to_timeline. Looking at the fix you propose, I believe the problem lies within this step. A large issue I ran into is that there are no "death" events, only a tapering off of activity. I'd "solved" this by implementing a floor of time and at that point (which is their last activity time + the floor) they would have "died". This is coded as another row in the dataset where nothing happens in that time period except for the "death" event.

I've taken the two datasets used in the formatting step of this and stripped them down similarly to the one I've shown above: df_base.xlsx df_cv.xlsx

Here is the code that I use to create the dataset fed into the CoxTimeVaryingFitter() above:

from lifelines.utils import to_long_format, add_covariate_to_timeline

df_base = pd.read_excel("df_base.xlsx")
df_cv = pd.read_excel("df_cv.xlsx")

df_ctv = df_base.pipe(to_long_format, duration_col="Duration")\
    .pipe(add_covariate_to_timeline, df_cv, duration_col="Time", id_col="ID", event_col="Event", cumulative_sum=True)
df_ctv.head()

Here we see those rows of instantaneous events pop up. I think this may also be in part to my misunderstanding of what sorts of timelines are expected to be fed into the CoxTimeVaryingFitter().

Cheers, Indiana

CamDavidsonPilon commented 5 years ago

Yes, I see the problem. I'll need to change some internal code to handle this. Thanks for the follow up!

CamDavidsonPilon commented 5 years ago

I need to think more about the following question:

suppose we are recording a patient's measurements over time. One day, as we walk into the room, and patient dies (say at time t). A moment later, at time t+1, we look at the subject's measurements, and record them.

i) Are the data record at time t+1 "allowed" in the inference? ii) What if we had looked at measurements at time t instead? t-1?

It seems silly to discard this row, but there are very valid cases where it makes sense to. Ex: if the measurement is heart rate, then at t+1 heart rate =0, and our inference will strongly suggest that "0 heart rate => death", which is reverse causality.

What I'm trying to determine is if I should keep observations that land on exactly the time of death, or should they be discarded (implicitly or explicitly by the user).

indiana-nikel commented 5 years ago

Looking at events that occur exactly at the time of death think this would depend on what the event is. If it's a perfect correlation with "death" (heart rate = 0, clicks the 'Delete Account' button in a SaaS example), then we would have convergence problem (I ran into this in a similar but unrelated piece of work using the R package survival). Otherwise, that event might be a clear indicator of why patient/customer would die/leave.

Does this question pertain to covariates changing over time or would it include categorical information to discard as well?

This also raises the question of instantaneous events at the beginning of a timeline as well. In my example, I've added 1 so that there is no instantaneous events at t=0, all starting events should start at t=1. Would this suggest that it makes sense to do the same for a death event?