Bugs of Incorrect Calculation of Baseline Hazard & Baseline Cumulative Hazard

bofeng2018 commented 1 year ago

Hello, I am a senior data scientist from Prudential Financial. While working on one project involving the Cox proportional hazard models, I found that the baseline hazard and baseline cumulative hazard were obviously calculated incorrectly using the CoxPHFitter module within the lifelines package. Before jumping into the bug details, I want to share the version information first. The lifelines package I was using was 0.27.0, but by the time I checked the source codes in GitHub again this morning, which belonged to the version 0.27.4 I believe, I still saw the same bugs.

The bugs originate from the _fit_model function in the SemiParametricPHFitter class, which start from the line 1252 of the coxph_fitter.py file. Notice that, the standardized data are supplied to the _fitmodel function for further estimation. This will not be a problem for the Cox coefficient estimation (i.e., the params in the function), because they are restored to their original scales after being divided by corresponding standard deviations of the original data as in line 1399. However, no similar action has been taken for the predicted_partialhazards calculated in line 1392. I notice that in line 1393, a matrix multiplication of the standardized data with the uncorrected Cox coefficients was used to avoid the scale issues. However, the location issues were never taken care of. As a result, it is almost like the raw data are shifted which direction and extent depend on their original mean values, and the effect of the shift is incorrectly transferred to the baseline hazard of the Cox model. Thus, it causes all the subsequent calculations of the baseline hazard and baseline cumulative hazard to be incorrect.

The fixes for the bugs are straightforward, which is basically to use the unstandardized data for baseline hazard related calculations. I have appended some codes below for a lazy fix. By adding them right at the line 1270, it should generate the correct baseline hazard and baseline cumulative hazard. However, this is definitely not ideal, since incorrect calculations are not removed but rather just replaced. If desired, I would be very happy to work with the lifelines development team to come up with a permanent and neater fix for it.

predicted_partial_hazards_ = (
    pd.DataFrame(np.exp(dot(X.values, self.params_)), columns=["P"]).assign(T=T.values, E=E.values, W=weights.values).set_index(X.index)
)
self.baseline_hazard_ = self._compute_baseline_hazards(predicted_partial_hazards_)
self.baseline_cumulative_hazard_ = self._compute_baseline_cumulative_hazard(self.baseline_hazard_)

CamDavidsonPilon commented 1 year ago

Hi @bofeng2018, thank you for the detailed issue. This is interesting. Give me a few days to try some corrections.

bofeng2018 commented 1 year ago

Hi @bofeng2018, thank you for the detailed issue. This is interesting. Give me a few days to try some corrections.

Happy new year! It has been a while and just want to follow up with the issue that I reported. Please let me know if there is anything that I can help with.

CamDavidsonPilon commented 1 year ago

Hey @bofeng2018, I was just thinking about this issue. I haven't had time to devote to this - do you want to submit a PR with a recommend change (even a temporary one that I can replace later)? With a fix in, I can cut a new release too

bofeng2018 commented 1 year ago

Hey @bofeng2018, I was just thinking about this issue. I haven't had time to devote to this - do you want to submit a PR with a recommend change (even a temporary one that I can replace later)? With a fix in, I can cut a new release too

Sure. I just created a pull request. Please let me know if you see any problems.

user799595 commented 11 months ago

Hello

I came accross this issue, and was wondering if it is still open and what the impact is.

I would like to use lifelines for Cox PH regression, but I don't know enough about this area to be decide if this implementation is safe to use.

Thank you!

CamDavidsonPilon commented 8 months ago

Hi, I'm finally investigating this more carefully. I've written these two tests below, but am unable to make them fail (all assertions are true).

    def test_baseline_prediction_with_extreme_means(self, rossi):
        cph = CoxPHFitter()
        cph.fit(rossi, "week", "arrest")

        rossi_shifted = rossi.copy()
        rossi_shifted['prio'] += 100
        cph_shifted = CoxPHFitter()
        cph_shifted.fit(rossi_shifted, "week", "arrest")

        # make sure summary stats are equal
        assert_frame_equal(cph_shifted.summary, cph.summary)

        # confirm hazards are equal
        assert_frame_equal(cph.baseline_hazard_, cph_shifted.baseline_hazard_)
        assert_frame_equal(cph.baseline_cumulative_hazard_, cph_shifted.baseline_cumulative_hazard_)

    def test_baseline_prediction_with_extreme_scaling(self, rossi):
        cph = CoxPHFitter()
        cph.fit(rossi, "week", "arrest")

        rossi_scaled = rossi.copy()
        rossi_scaled['prio'] *= 100
        cph_scaled = CoxPHFitter()
        cph_scaled.fit(rossi_scaled, "week", "arrest")

        # make sure summary stats are equal - note that CI and coefs are unequal since we scaled params.
        assert_frame_equal(cph_scaled.summary[['z', 'p']], cph.summary[['z', 'p']])

        # confirm hazards are equal
        assert_frame_equal(cph.baseline_hazard_, cph_scaled.baseline_hazard_)
        assert_frame_equal(cph.baseline_cumulative_hazard_, cph_scaled.baseline_cumulative_hazard_)

Also, when I implement your solution, the following important tests fail: https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4594

https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4573

Some others fail, too, but these are the more relevant ones as they compare against a R's survival library.

Can you provide a simple example, dummy data or not, that shows the difference?

bofeng2018 commented 8 months ago

Hi Cameron,

Sorry for the late reply. It took me a while to prepare the notebook in the attachment. You can refer to the notebook to see solid proof of what was wrong with the baseline (cumulative) hazard calculation.

I did not have time to go through the R library to see the other issues you mentioned. Maybe after we reach an agreement on the issues as demonstrated in the notebook, we can talk again about the comparison issues.

Regards,

Lizhou

Lizhou Nie Senior Data Scientist Prudential Financial

On Sat, Dec 30, 2023 at 4:53 PM Cameron Davidson-Pilon < @.***> wrote:

Hi, I'm finally investigating this more carefully. I've written these two tests below, but am unable to make them fail (all assertions are true).
def test_baseline_prediction_with_extreme_means(self, rossi):
    cph = CoxPHFitter()
    cph.fit(rossi, "week", "arrest")

    rossi_shifted = rossi.copy()
    rossi_shifted['prio'] += 100
    cph_shifted = CoxPHFitter()
    cph_shifted.fit(rossi_shifted, "week", "arrest")

    # make sure summary stats are equal
    assert_frame_equal(cph_shifted.summary, cph.summary)

    # confirm hazards are equal
    assert_frame_equal(cph.baseline_hazard_, cph_shifted.baseline_hazard_)
    assert_frame_equal(cph.baseline_cumulative_hazard_, cph_shifted.baseline_cumulative_hazard_)

def test_baseline_prediction_with_extreme_scaling(self, rossi):
    cph = CoxPHFitter()
    cph.fit(rossi, "week", "arrest")

    rossi_scaled = rossi.copy()
    rossi_scaled['prio'] *= 100
    cph_scaled = CoxPHFitter()
    cph_scaled.fit(rossi_scaled, "week", "arrest")

    # make sure summary stats are equal - note that CI and coefs are unequal since we scaled params.
    assert_frame_equal(cph_scaled.summary[['z', 'p']], cph.summary[['z', 'p']])

    # confirm hazards are equal
    assert_frame_equal(cph.baseline_hazard_, cph_scaled.baseline_hazard_)
    assert_frame_equal(cph.baseline_cumulative_hazard_, cph_scaled.baseline_cumulative_hazard_)
Also, when I implement your solution, the following important tests fail:

https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4594

https://github.com/CamDavidsonPilon/lifelines/blob/master/lifelines/tests/test_estimation.py#L4573

Some others fail, too, but these are the more relevant ones as they compare against a R's survival library.

Can you provide a simple example, dummy data or not, that shows the difference?

— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1475#issuecomment-1872613604, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCTY7PGLT4VUBRIWE23QF3YMCENNAVCNFSM6AAAAAASORQBPWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGYYTGNRQGQ . You are receiving this because you were mentioned.Message ID: @.***>

CamDavidsonPilon commented 8 months ago

@bofeng2018 email attachments won't show in issues, can you repost your nb in the issue thread?

bofeng2018 commented 8 months ago

Demonstration.ipynb.zip

Attached is the zipped nb. Let me know if you can't open it.

CamDavidsonPilon commented 8 months ago

(Sorry about the spam - I keep alternating between if I am convinced I've found the problem or not - still investigating)

bofeng2018 commented 8 months ago

(Sorry about the spam - I keep alternating between if I am convinced I've found the problem or not - still investigating)

Totally understandable. It also took me a while to finally confirm that the problem was because of the data standardization. Let me know if you have any questions.

CamDavidsonPilon commented 8 months ago

Okay, here's an (updating) summary of what I know. This may change:

the baseline hazard (and cumulative hazard) can be computed at the mean values, or at all zeros. We, don't ask why yet, defaulted to the latter. This is the same as R's basehaz(..., centered=FALSE) which is what we test against¹.
From @bofeng2018 notebook, we can compute the factor the cumulative baseline hazard is "off": exp(-healthScoreCoef * 0.5) (the 0.5 is the mean of the single covariate). This makes the curves line up:
generally, the factor is exp(-self.params dot self._norm_means)
I'm putting "off" in quotations, since it's partially a decision for the user (that I made for them)
Need to investigate the predict functions.
Thinking about why centered=FALSE was chosen:
- the current method has the property that the baseline hazard is indp. of shifts in the covariates. That is, if we shift a covariate by 100, the baseline hazard doesn't change. This is desirable: all else being equal, if I change my measurement stick to start at 100, this shouldn't effect survival. Unless I start pairing it with covariates - then I need to handle the 100 somewhere. (Note that shifting by 100 never effects the coefs).
- the proposed method doesn't have this property. If I add 10 to the dat, ex: dat['Health Score'] += 10, then the proposed result looks skewed, whereas the current method's result is the same:

Suffice to say, this is confusing for us, so it needs to be made clear for users.

¹ although our tests say centered=TRUE, which tripped me up.

CamDavidsonPilon commented 8 months ago

Some more thoughts:

this exp(-params * means) factor is introduced in the predict functions. So when you call predict_survival_function(x), a exp(-params * means) is introduced. At a high level:

predict_survival_function(x) = exp(-CBH(t) * exp(params * (x - means))
                         = exp(-CBH(t) * exp(-params * means) * exp(params * x)
                         = exp(- (CBH(t) * exp(-params * means)) * exp(params * x)

So I think the problem is that it's not obvious where the factor is being introduced. We either introduce it in the calculation of the baseline hazard and then avoid it in the predict functions, or vice versa.

bofeng2018 commented 8 months ago

Okay, here's an (updating) summary of what I know. This may change:

the baseline hazard (and cumulative hazard) can be computed at the mean values, or at all zeros. We, don't ask why yet, defaulted to the latter. This is the same as R's basehaz(..., centered=FALSE) which is what we test against¹.

From @bofeng2018 notebook, we can compute the factor the cumulative baseline hazard is "off": exp(-healthScoreCoef * 0.5) (the 0.5 is the mean of the single covariate). This makes the curves line up:

generally, the factor is exp(-self.params dot self._norm_means)

I'm putting "off" in quotations, since it's partially a decision for the user (that I made for them)

Need to investigate the predict functions.

Thinking about why centered=FALSE was chosen:

the current method has the property that the baseline hazard is indp. of shifts in the covariates. That is, if we shift a covariate by 100, the baseline hazard doesn't change. This is desirable: all else being equal, if I change my measurement stick to start at 100, this shouldn't effect survival. Unless I start pairing it with covariates - then I need to handle the 100 somewhere. (Note that shifting by 100 never effects the coefs).

the proposed method doesn't have this property. If I add 10 to the dat, ex: dat['Health Score'] += 10, then the proposed result looks skewed, whereas the current method's result is the same:

Suffice to say, this is confusing for us, so it needs to be made clear for users.

¹ although our tests say centered=TRUE, which tripped me up.

The weird second graph is probably caused by super low mortality rates. Instead of dat['Health Score'] += 10, can you try dat['Health Score'] -= 10? The blue data points were based on classical textbook methods. Unless the implementation was wrong, they should be very consistent with theoretical results with large sample sizes.

bofeng2018 commented 8 months ago

Some more thoughts:

this exp(-params * means) factor is introduced in the predict functions. So when you call predict_survival_function(x), a exp(-params * means) is introduced. At a high level:
predict_survival_function(x) = exp(-CBH(t) * exp(params * (x - means))
                             = exp(-CBH(t) * exp(-params * means) * exp(params * x)
                             = exp(- (CBH(t) * exp(-params * means)) * exp(params * x)
So I think the problem is that it's not obvious where the factor is being introduced. We either introduce it in the calculation of the baseline hazard and then avoid it in the predict functions, or vice versa.

Yes, I'm confident that the problem comes from the data standardization procedure. To be more specific, covariate means were incorrectly shifted to the baseline hazard. As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? That applies to lines 1248-1250 in the coxph_fitter.py file, and the corresponding codes are shown below. I would be very curious to see how it changes the results.

        X_norm = pd.DataFrame(
            utils.normalize(X.values, self._norm_mean.values, self._norm_std.values), index=X.index, columns=X.columns
        )

CamDavidsonPilon commented 8 months ago

Here's my edits of your notebook, I made some changes to reduce the complexity and focus on a simpler example. Demonstration 2.ipynb.zip

To be more specific, covariate means were incorrectly shifted to the baseline hazard.

I don't think this is the correct interpretation. The covariate means have to go somewhere. Either they exist in the log-partial-hazard, or they exist in baseline hazard. We (implicitly) have them exist in the log-partial-hazard.

More explicitly:

to compute CBH(t), one uses exp(params * (x - means) in computations (as we do in lifelines)
to compute CBH'(t) := CBH(t) * exp(-params * means), one uses exp(params * x)

If I understand correctly, the differences we see are due to what we think the cumulative base hazard is: CBH'(t) or CBH(t). Answer is: it doesn't matter so long as one is explicit. If the data is demeaned before entering lifelines, then they are equal and there is no issue.

As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? ... I would be very curious to see how it changes the results.

Sorry, I'm not following: which results should we be looking at?

bofeng2018 commented 8 months ago

Here's my edits of your notebook, I made some changes to reduce the complexity and focus on a simpler example. Demonstration 2.ipynb.zip

To be more specific, covariate means were incorrectly shifted to the baseline hazard.

I don't think this is the correct interpretation. The covariate means have to go somewhere. Either they exist in the log-partial-hazard, or they exist in baseline hazard. We (implicitly) have them exist in the log-partial-hazard.

More explicitly:

to compute CBH(t), one uses exp(params * (x - means) in computations (as we do in lifelines)

to compute CBH'(t) := CBH(t) * exp(-params * means), one uses exp(params * x)

If I understand correctly, the differences we see are due to what we think the cumulative base hazard is: CBH'(t) or CBH(t). Answer is: it doesn't matter so long as one is explicit. If the data is demeaned before entering lifelines, then they are equal and there is no issue.

As a simple way to confirm it, for data preprocessing, can you try to only convert the covariate variations to be 1 but not shift their locations? ... I would be very curious to see how it changes the results.

Sorry, I'm not following: which results should we be looking at?

Yes, I agree we probably have disagreements on what would define a baseline. I guess I'm always following the most classical way that a Cox proportional hazard model would be formulated, which does not include steps to subtract sample means from each covariate. In this case, the baseline hazard would be calculated in the way I described.

However, I also totally agree that the most important thing is to get the final hazard estimations correctly. As long as covariate sample means are always subtracted from future data as in the case of the training data, the estimated hazard functions should be correct.

Overall, very inspiring discussions! Thanks for all the thought sharing!

CamDavidsonPilon commented 8 months ago

Overall, very inspiring discussions! Thanks for all the thought sharing!

💯

This is forcing me to re-evaluate these very important details. We'll make some changes to make this easier (at the very least: more transparent) for users.

CamDavidsonPilon / lifelines

Bugs of Incorrect Calculation of Baseline Hazard & Baseline Cumulative Hazard #1475

Lizhou

Some others fail, too, but these are the more relevant ones as they compare against a R's survival library.