CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.38k stars 560 forks source link

Overflow with large numbers in Nelson Aalen fitter #1585

Closed shilet closed 11 months ago

shilet commented 11 months ago

When determining the confidence interval for the Nelson Aalen Fitter, the following formula is used: def _variance_f_discrete(self, population, deaths): return (population - deaths) * deaths / population ** 3

The problem is that population (population = events["at_risk"] - entrances) is an integer, this should be changed to float to prevent overflow. Otherwise the resulting confidence interval is NaN.

CamDavidsonPilon commented 11 months ago

Hi @shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.

shilet commented 11 months ago

Hi Cameron,

I was just using the NelsonAalenFitter(nelson_aalen_smoothing=False), and got the following error:

<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 5.77905e+06 total observations, 0 right-censored observations> \lib\site-packages\lifelines\fitters\nelson_aalen_fitter.py:176: RuntimeWarning: invalid value encountered in sqrt df[ci_labels[0]] = cumhazard np.exp(-z np.sqrt(cumulativesq) / np.where(cumhazard == 0, 1, cumhazard))

The confidence interval for the cumulative hazard contains weird (way too large) numbers, and also NaN/Inf. When using a smaller dataset I didn't have this problem.

The error disappears after I changed this: in utils/init.py: population = events["at_risk"] - entrances population = population.astype(float)

So that's why I guess it is some sort of overflow issue (but my python knowledge is limited...)

Hopefully this helps

Best regards,

Sheila

On Sat, 30 Dec 2023 at 22:02, Cameron Davidson-Pilon < @.***> wrote:

Hi @shilet https://github.com/shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.

— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1872606537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBJWXOOZZYEX4HU327LYMB6PDAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGYYDMNJTG4 . You are receiving this because you were mentioned.Message ID: @.***>

CamDavidsonPilon commented 11 months ago

hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?

shilet commented 11 months ago

Hi,

I can reproduce with the following:

y = np.random.randint(1, 1000, 100000000) naf = NelsonAalenFitter(nelson_aalen_smoothing=False) naf.fit(y ,event_observed=None, timeline=range(0, int(y.max())))

On Sun, 31 Dec 2023 at 18:49, Cameron Davidson-Pilon < @.***> wrote:

hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?

— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1873004915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBKDV3JQ2TEFBYJ3MWTYMGQSPAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGAYDIOJRGU . You are receiving this because you were mentioned.Message ID: @.***>

CamDavidsonPilon commented 11 months ago

Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:

    def _variance_f_discrete(self, population, deaths):
        return (population - deaths) * deaths / population ** 3

to

    def _variance_f_discrete(self, population, deaths):
        return (1 - deaths / population) * (deaths / population) * (1. / population)

will be fixed in the next release

shilet commented 11 months ago

Hi,

Happily you could reproduce it. Is there a reason you didn't repair it by converting population to float (population = population.astype(float))?

On Mon, 1 Jan 2024 at 22:15, Cameron Davidson-Pilon < @.***> wrote:

Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:

def _variance_f_discrete(self, population, deaths):
    return (population - deaths) * deaths / population ** 3

to

def _variance_f_discrete(self, population, deaths):
    return (1 - deaths / population) * (deaths / population) * (1. / population)

— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1873482959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBKA6IBUG3QOD4YXF33YMMRPFAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGQ4DEOJVHE . You are receiving this because you were mentioned.Message ID: @.***>

CamDavidsonPilon commented 11 months ago

floats (or float64, in pandas) have the same problem: they'll overflow eventually. However, they'll overflow at a higher number. So why not use float64? While the upperbound of a float64 is higher than a int64, a float64s precision is severely reduced for larger numbers, so your final values aren't precise and could look quite silly. The implemented solution avoids computing these large numbers, so we can keep the precision of int64s while avoiding overflows.