Open dieguer opened 5 years ago
Yea, I always worried about this. Can I ask you a few questions:
"log linear aft model " - do you mean log-normal AFT model, or log-logistic AFT model?
How many unique end points do you have? Like, if you did df['T'].value_counts()
- how big is the resulting series?
Are you using the times
argument in predict_survival_function
?
Thanks for your quick answer @CamDavidsonPilon.
1.- I mean log-normal AFT model
2.- I have 1093 unique end points
3.-I have tried both with the default argument and times=np.linspace(0, 1200, 600)
, however the system runs out of memory in both cases.
Another question: what purpose do you need the entire survival function for?
I think there are two problems:
1) 12m survival curves is a lot. Naively, the resulting data structure contains 12m x 1093 = 13B floats - that is a lot for Python. Is that too much for 425gb of RAM? Probably not? (back of the envelope, it should only take up 100GB. But that's still 25% of the system). To help this, use times
argument with a smaller cardinality, ex: np.linspace(0, 1200, 50)
2) In lifelines, after we create the matrix of survival functions, we pass it into a DataFrame. This might (I am not familiar enough with df constructions) blow up (at worst: duplicate) the size in memory. If this is the case, the following could work (I've removed Pandas code):
def predict_cumulative_hazard(self, df, times=None):
"""
Predict the cumulative hazard for individuals, given their covariates.
Parameters
----------
df: DataFrame
a (n,d) DataFrame. If a DataFrame, columns
can be in any order. If a numpy array, columns must be in the
same order as the training data.
times: iterable, optional
an iterable of increasing times to predict the cumulative hazard at. Default
is the set of all durations in the training dataset (observed and unobserved).
"""
times = coalesce(times, self.timeline, np.unique(self.durations))
n = df.shape[0]
Xs = self._create_Xs_dict(df)
params_dict = {
parameter_name: self.params_.values[self._LOOKUP_SLICE[parameter_name]]
for parameter_name in self._fitted_parameter_names
}
return self._cumulative_hazard(params_dict, np.tile(times, (n, 1)).T, Xs)
(Note that this is predict_cumulative_hazard
, but predict_survival_function
uses this directly).
Thanks that really helped!
Did it? Which one (or both)?
Cutting the number of points in the search grid. When trying just the edit on the code the results where the same.
I am working with a data set of 12 million observations and around 30 columns. I fit log linear aft model and then try use
predict_survival_function
. I am operating on a VM with 425gb of RAM . Somehow the procedure depletes the system memory and is not able to make the prediction.