Closed shilet closed 11 months ago
Hi @shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.
Hi Cameron,
I was just using the NelsonAalenFitter(nelson_aalen_smoothing=False), and got the following error:
<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 5.77905e+06 total observations, 0 right-censored observations> \lib\site-packages\lifelines\fitters\nelson_aalen_fitter.py:176: RuntimeWarning: invalid value encountered in sqrt df[ci_labels[0]] = cumhazard np.exp(-z np.sqrt(cumulativesq) / np.where(cumhazard == 0, 1, cumhazard))
The confidence interval for the cumulative hazard contains weird (way too large) numbers, and also NaN/Inf. When using a smaller dataset I didn't have this problem.
The error disappears after I changed this: in utils/init.py: population = events["at_risk"] - entrances population = population.astype(float)
So that's why I guess it is some sort of overflow issue (but my python knowledge is limited...)
Hopefully this helps
Best regards,
Sheila
On Sat, 30 Dec 2023 at 22:02, Cameron Davidson-Pilon < @.***> wrote:
Hi @shilet https://github.com/shilet, are you able to provide an example of this failing? AFAIK, integers in Python shouldn't overflow.
— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1872606537, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBJWXOOZZYEX4HU327LYMB6PDAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZSGYYDMNJTG4 . You are receiving this because you were mentioned.Message ID: @.***>
hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?
Hi,
I can reproduce with the following:
y = np.random.randint(1, 1000, 100000000) naf = NelsonAalenFitter(nelson_aalen_smoothing=False) naf.fit(y ,event_observed=None, timeline=range(0, int(y.max())))
On Sun, 31 Dec 2023 at 18:49, Cameron Davidson-Pilon < @.***> wrote:
hm, I'm not able to reproduce it. Can you share the dataset (unlikely)? Can you tell me more about your dataset (any negative values, any late entry being used)?
— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1873004915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBKDV3JQ2TEFBYJ3MWTYMGQSPAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGAYDIOJRGU . You are receiving this because you were mentioned.Message ID: @.***>
Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64
, which can overflow. The population**3
term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:
def _variance_f_discrete(self, population, deaths):
return (population - deaths) * deaths / population ** 3
to
def _variance_f_discrete(self, population, deaths):
return (1 - deaths / population) * (deaths / population) * (1. / population)
will be fixed in the next release
Hi,
Happily you could reproduce it. Is there a reason you didn't repair it by converting population to float (population = population.astype(float))?
On Mon, 1 Jan 2024 at 22:15, Cameron Davidson-Pilon < @.***> wrote:
Thanks, I was able to reproduce it with your example. I believe the problem is integer overflow, as you predicted. It's correct that Python integers don't overflow, but when they are inside a Pandas series, they are int64, which can overflow. The population**3 term in the calculation was creating a negative value where there shouldn't have been. I've fixed it by avoiding that large exponent:
def _variance_f_discrete(self, population, deaths): return (population - deaths) * deaths / population ** 3
to
def _variance_f_discrete(self, population, deaths): return (1 - deaths / population) * (deaths / population) * (1. / population)
— Reply to this email directly, view it on GitHub https://github.com/CamDavidsonPilon/lifelines/issues/1585#issuecomment-1873482959, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIK4VBKA6IBUG3QOD4YXF33YMMRPFAVCNFSM6AAAAABAXXWRH6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZTGQ4DEOJVHE . You are receiving this because you were mentioned.Message ID: @.***>
floats (or float64, in pandas) have the same problem: they'll overflow eventually. However, they'll overflow at a higher number. So why not use float64? While the upperbound of a float64 is higher than a int64, a float64s precision is severely reduced for larger numbers, so your final values aren't precise and could look quite silly. The implemented solution avoids computing these large numbers, so we can keep the precision of int64s while avoiding overflows.
When determining the confidence interval for the Nelson Aalen Fitter, the following formula is used: def _variance_f_discrete(self, population, deaths): return (population - deaths) * deaths / population ** 3
The problem is that population (population = events["at_risk"] - entrances) is an integer, this should be changed to float to prevent overflow. Otherwise the resulting confidence interval is NaN.