Add Kaplan-Meier Estimator for survival curves

clorton commented 1 month ago

Useful for initializing predicted age at death.

Fixes #43 and #41

Should be reviewed after #64 and #71 are merged to cut down on apparent changes.

jonathanhhb commented 1 month ago

Would it make sense to add an optional ctor which takes a file/path such that the csv load and read is done for you?

jonathanhhb commented 1 month ago

  File "/home/jbloedow/PLAY/LASER/laser-chris-nnmm/src/laser_core/demographics/kmestimator.py", line 44, in predict_year_of_death
    year_of_death = _pyod(ages_years, self.cumulative_deaths, max_year)
  File "/var/opt/idm/venv_laser/lib/python3.10/site-packages/numba/core/dispatcher.py", line 658, in _explain_matching_error
    raise TypeError(msg)
TypeError: No matching definition for argument type(s) array(float64, 1d, C), array(float64, 1d, C), uint32

from

# Ages of individuals in years (example data)
ages_years = np.array([40, 50, 60], dtype=np.float64)  # Ensure ages_years is float64

# Convert max_year to uint32
max_year = np.uint32(100)

# Predict the year of death for individuals
predicted_year_of_death = estimator.predict_year_of_death(ages_years, max_year=max_year)

Can the code be more robust to the datatypes it's given?

clorton commented 1 month ago

Would it make sense to add an optional ctor which takes a file/path such that the csv load and read is done for you?

The current CSV function is so specific to a particular format that I don't want to tie it to the KME ctor.

jonathanhhb commented 1 month ago

Predicted Year of Death for 1M individuals took: 0.008349895477294922 seconds
Predicted Age at Death (in days) for 1M individuals took: 0.018344402313232422 seconds

Which seems good.

jonathanhhb commented 1 month ago

Would it make sense to add an optional ctor which takes a file/path such that the csv load and read is done for you?

The current CSV function is so specific to a particular format that I don't want to tie it to the KME ctor.

It's a two-column csv where the first column is age in years and second column is cumulative deaths. I even asked GPT for the least surprising csv fileformat for an input file to this class and this was is, with header-line optional. I think it should have built-in support for reading in a file like this. :)

clorton commented 1 month ago

Would it make sense to add an optional ctor which takes a file/path such that the csv load and read is done for you?

The current CSV function is so specific to a particular format that I don't want to tie it to the KME ctor.

It's a two-column csv where the first column is age in years and second column is cumulative deaths. I even asked GPT for the least surprising csv fileformat for an input file to this class and this was is, with header-line optional. I think it should have built-in support for reading in a file like this. :)

Apologies, I was thinking about the population pyramid CSV with header, min-max, and final bucket particulars. Yes, we can have an optional parameter for a CSV filename/path.

jonathanhhb commented 1 month ago

Would it make sense to add an optional ctor which takes a file/path such that the csv load and read is done for you?

The current CSV function is so specific to a particular format that I don't want to tie it to the KME ctor.

It's a two-column csv where the first column is age in years and second column is cumulative deaths. I even asked GPT for the least surprising csv fileformat for an input file to this class and this was is, with header-line optional. I think it should have built-in support for reading in a file like this. :)

Apologies, I was thinking about the population pyramid CSV with header, min-max, and final bucket particulars. Yes, we can have an optional parameter for a CSV filename/path.

You're making me want to go back and push for a truly built-in load-from-file in the AliasedDistribution/pyramid code. :) We've already got the load_ function provided as a utility so seems like not too much of a stretch to use that implicitly if the file provided matches. We could modify the fileformat to be a bit more standard if we don't like the hyphen separators for age ranges.

clorton commented 1 month ago

Re. population pyramid file format, how about this regularized schema:

AgeMin,AgeMax,Males,Females
min0,max0,males0,females0
min1,max1,males1,females1
...
minN,maxN,malesN,femalesN

Actual header text, probably not important, I would support a hasheader argument which could be False to indicate a "naked" CSV. Each bin would be [min,max] - closed interval, i.e., max is a valid year/age for someone in that bin. Would we enforce maxN==minN?

Do you want to open another issue/ticket for supporting this (since it isn't germane to the Kaplan-Meier Estimator)?

jonathanhhb commented 1 month ago

Thanks for accommodating my obsession with input files! :) Just trying one more thing and then will sign off on this.

jonathanhhb commented 1 month ago

Good enough for me.

InstituteforDiseaseModeling / laser

Add Kaplan-Meier Estimator for survival curves #72