CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

predict_survival_function uses too much memory #786

Open dieguer opened 5 years ago

dieguer commented 5 years ago

I am working with a data set of 12 million observations and around 30 columns. I fit log linear aft model and then try use predict_survival_function . I am operating on a VM with 425gb of RAM . Somehow the procedure depletes the system memory and is not able to make the prediction.

CamDavidsonPilon commented 5 years ago

Yea, I always worried about this. Can I ask you a few questions:

  1. "log linear aft model " - do you mean log-normal AFT model, or log-logistic AFT model?

  2. How many unique end points do you have? Like, if you did df['T'].value_counts() - how big is the resulting series?

  3. Are you using the times argument in predict_survival_function?

dieguer commented 5 years ago

Thanks for your quick answer @CamDavidsonPilon.

1.- I mean log-normal AFT model

2.- I have 1093 unique end points

3.-I have tried both with the default argument and times=np.linspace(0, 1200, 600) , however the system runs out of memory in both cases.

CamDavidsonPilon commented 5 years ago

Another question: what purpose do you need the entire survival function for?

CamDavidsonPilon commented 5 years ago

I think there are two problems:

1) 12m survival curves is a lot. Naively, the resulting data structure contains 12m x 1093 = 13B floats - that is a lot for Python. Is that too much for 425gb of RAM? Probably not? (back of the envelope, it should only take up 100GB. But that's still 25% of the system). To help this, use times argument with a smaller cardinality, ex: np.linspace(0, 1200, 50)

2) In lifelines, after we create the matrix of survival functions, we pass it into a DataFrame. This might (I am not familiar enough with df constructions) blow up (at worst: duplicate) the size in memory. If this is the case, the following could work (I've removed Pandas code):

    def predict_cumulative_hazard(self, df, times=None):
        """
        Predict the cumulative hazard for individuals, given their covariates.

        Parameters
        ----------

        df: DataFrame
            a (n,d) DataFrame. If a DataFrame, columns
            can be in any order. If a numpy array, columns must be in the
            same order as the training data.
        times: iterable, optional
            an iterable of increasing times to predict the cumulative hazard at. Default
            is the set of all durations in the training dataset (observed and unobserved).

        """
        times = coalesce(times, self.timeline, np.unique(self.durations))
        n = df.shape[0]
        Xs = self._create_Xs_dict(df)

        params_dict = {
            parameter_name: self.params_.values[self._LOOKUP_SLICE[parameter_name]]
            for parameter_name in self._fitted_parameter_names
        }

        return self._cumulative_hazard(params_dict, np.tile(times, (n, 1)).T, Xs)

(Note that this is predict_cumulative_hazard, but predict_survival_function uses this directly).

dieguer commented 5 years ago

Thanks that really helped!

CamDavidsonPilon commented 5 years ago

Did it? Which one (or both)?

dieguer commented 5 years ago

Cutting the number of points in the search grid. When trying just the edit on the code the results where the same.