Lifetable calculations - Githubissues

luytax commented 5 months ago

Goal: find out why there is a ~3% difference in deaths between AirQ+ and BEST-COST lifetable calculations Answer: In the BEST-COST calculation (and also in GeLuft and ARE calculations), when moving from the year of analysis (YOA) to the YOA+1 in the lifetable the column gets shifted one row down. I.e. an empty field is introduced at age=0, and the "lowest field" (age 99) falls out of the lifetable and is not considered. The premature deaths of this "lowest" field makes up around 3% of total premature deaths (a mismatch of -0.23%) remains between AirQ+ and BEST-COST results, which are slightly lower. AirQ+ assumes that every year newborns enter the lifetable cohort: the newborns fill the empty field we have in BEST-COST at age=0 in the YOA+1. The field highlighted in yellow is present in AirQ+ but not in BEST-COST calculations.

luytax commented 5 months ago

The input variables prob_total_death_male and prob_total_death_female are not used in the function attribute_deaths_lifetable_rr()

In the input data used in testing_Rpackage for checking attribute_deaths_lifetable_rr() the total and natural probabilities of dyning (prob_dying[["total"]][["female"]]$prob_dying and prob_dying[["natural"]][["female"]]$prob_dying) are exactly the same.

luytax commented 4 months ago

(@ungatoverde FYI and for documentation)

In the AirQ+ approach, two population matrixes are calculated: 1) In the modelled matrix (where air pollution is at the current level), the population is projected into the future using the observed death rates from the input lifetable 2) In the cutoff matrix (where air pollution is at the cutoff level), the population is projected into the future using modified death rates. The death rates (for the age groups that the RR is valid for) are modified with a factor, which is based on a re-scaled RR (see screenshot below). Then the modified survival rate is calculated using a different formula (see screenshot below).

The YLL are calculated as the difference in the mid-term populations of the modelled matrix and the cutoff matrix. Premature deaths are calculated as YLL * 2 = premature deaths.

Screenshot above from formulas 7) and 8A) from the AirQ+ lifetable manual for calculating modified hazard rate. The beta coefficient (with confidence intervals) is given in the AirQ+ output Excel: 0.0111541374732907 (CI: 0.00582689081239758 - 0.0164666621555233).

Screenshot above from formula 4. from the AirQ+ manual for calculating the modified survival probability using the modified hazard rate as input.

This approach, where the survival probabilities can be varied for specific age groups or specific years and also births can be considered in the future (default assumption: each year, as many babies are born as the input population aged 0), allows the user to design an infinite number of different scenarios. Some examples include: 1) year change in AP; no new births; all ages are affected by air pollution 2) permanent change in AP, no new births; only ages 20+ affected by air pollution 3) permanent change in AP, new births, effects of air pollution are phased in over 20 years, ages 80+ are affected twice as much as people under 80 4) increasing air pollution exposure (5% increase from year to year) 5) ...

ungatoverde commented 4 months ago

@luytax: Is your statement below still valid?

The input variables prob_total_death_male and prob_total_death_female are not used in the function attribute_deaths_lifetable_rr()

I would be very surprise of it because the intention was always to use the prob_total_death to calculate the projected person-years (pop_impact).

@ungatoverde: Yes, I just searched all package functions in the "R" folder with "Find in files..." for the variable prob_total_death_male, and all instances where it's found are only of the sort prob_total_death_male = prob_total_death_male, i.e. the variable is passed along the "attribute_" functions but never used/implemented in calculations. prob_total_death_male is used in the function get_prob_dying_by_single_age(), which we call in our testing_Rpackage.Rmd file, but which is not called in any of the attribute_ functions either.

Update: I moved this discussion to the issue #237

ungatoverde commented 4 months ago

The YLL are calculated as the difference in the mid-term populations of the modelled matrix and the cutoff matrix.

@luytax : Did you find a way to make this calculation in a single step as we currently do or do you think that this simplification was only possible because we used a more simple method than the AirQ+? In the last case, I am afraid that we should be consistent and use always the difference of matrices

@ungatoverde: The current calculation can be improved for sure, e.g. the two loops now (one each for projection of reference / counterfactual population) can be combined. We'll have to see how much further the calculations can be simplified. To be discussed in weekly meeting.

ungatoverde commented 4 months ago

Premature deaths are calculated as YLL * 2 = premature deaths

I do not understand this. Why *2?

@ungatoverde: When we substract the reference mid-year population matrix from the counterfactual mid-year population matrix the result is the years of lost due to air pollution at a given age in a given year. With the alpha coefficient set to 0.5 we assume that people die in the middle of the year. So to determine the number of YLL at a given age in a given year we have to multiply the number of premature deaths by 2, because each person that dies at a given age in a given year, has lived for half a year --> 0.5 YLL per premature-death at a specific age in a specific year. (The general equation would be YLL = premature deaths alpha. So if alpha = 0.25, then 4 premature deaths (at a given age and in a given year) would equal 1 YLL, because 1 = 0.25 4.

@luytax: OK. Great explanation. Maybe not a bad idea to add similar explanations in the R code. Lifetables are a hell! :-S

ungatoverde commented 4 months ago

The beta coefficient (with confidence intervals) is given in the AirQ+ output Excel: 0.0111541374732907 (CI: 0.00582689081239758 - 0.0164666621555233).

@luytax Where is this beta value coming from? I guess that this somehow related with the relative risk. I have seen this formula many times and in many publications, but I have to admit that I have never fully understood it...

@ungatoverde: The beta value can be found at the top of the huge Excel that AirQ+ gives as output of the lifetable calculations (under EVALUATION PARAMETERS, see screenshot) In the screenshot only the rounded beta value is shown, but when clicking on the cell we see the unrounded value. Von ChatGPT: "Researchers use regression models (e.g., linear regression, logistic regression, Cox proportional hazards models) to analyze the relationship between pollutant levels and health outcomes. [...] The beta value (β) is a coefficient obtained from these regression models. It represents the change in the log relative risk (RR) of the health outcome for a unit change in pollutant concentration. [...]" And then it gave me links to non-existing scientific publications : S

@luytax: Ok. So beta is not used to calculate the attributable health impact. Beta is some kind theoretical subproduct. We do not directly use beta. So the equation and the log-linearity is not relevant.

@ungatoverde: For the AirQ+ lifetable method we use beta to determine the modified survival probability in the counterfactual scenario. Consequently we have to use the equation in the code.

Update: I moved this conversation to an own issue #248

ungatoverde commented 4 months ago

This approach, where the survival probabilities can be varied for specific age groups or specific years and also births can be considered in the future (default assumption: each year, as many babies are born as the input population aged 0), allows the user to design an infinite number of different scenarios

Of course, it would be great to include this feature in our package. Two important things to consider:

@luytax How much time is needed to implement this feature? We you said that you could have implemented the AirQ+ method for lifetables at the begining of August, did you mean with or without this feature? @ungatoverde The "constant air pollution" features is already implemented. The "newborn babies" feature can be added easily. Implementing the more complex scenarios should be possible, but 1) more coding effort, 2) more input variables and 3) very hard/tedious for user to input such detailed scenarios.
@luytax How many additional arguments would be needed in attribute() and compare()? @ungatoverde Depends on which features we will implement. For the "constant air pollution" & the "newborn babies" features only few additional variables needed: 1) pollution_constant (TRUE/FALSE), 2) new_births (TRUE/FALSE). For additional features additional variables would be needed: e.g. for user-specified number of newborn babies each year a newborn_vector variable; e.g. for chaning air pollution a air_pollution_vector variable specifying air pollution levels for each year; e.g. for exposures which target specific age groups differently we'd need a rr_vector specifying the relative risk for each age group; ...

@luytax, my take: Let's go for low hanging fruits, i.e. constant pollution and newborns, leaving the other two features for future developments. I have created a new (permanent) issue for that. Thankful if you contribute to populate it with developments out of capacity in BEST-COST. #247

luytax commented 4 months ago

Implementing exposure distribution in AirQ+ lifetable calculations: How is it implemented in the GeLuft lifetable approach? Can the approach be used for the AirQ+ lifetable calculations? How to calculate the beta and modification_factor coefficients with exposure distributions as inputs?

ungatoverde commented 4 months ago

Implementing exposure distribution in AirQ+ lifetable calculations: How is it implemented in the GeLuft lifetable approach? Can the approach be used for the AirQ+ lifetable calculations?

@luytax , The exposure distribution is currently handled in our R package (non-AirQ+ method) in a row wise mode (tibble where each row refer to an iteration case). A PAF is calculated for each of the exposure categories (using average exposure in the category, population exposed in this category, ERF and cutoff). Then the PAF is summed. The only difference of exposure distribution with iteration cases is that PAF in exposure distribution has to be aggregated a bit before. I do not remember now exactly why... To be reviewed.

ungatoverde commented 4 months ago

How to calculate the beta and modification_factor coefficients with exposure distributions as inputs?

@luytax I would not calculate beta for exposure distribution but use the existing structure of non-AirQ+ functions in our package. We just have to convert the beta-approach into a PAF approach. See #248

luytax commented 4 months ago

As written in #248, we can circumvent the beta coefficient. This also solves the issue regarding the calculations using exposure distributions.

luytax commented 4 months ago

@ungatoverde Proposal for BEST-COST lifetable method:

Deaths (of the whole year) as input data (and not death probabilities) (AirQ+ approach)
Mid-year population data as (default) input (current state), but optionally allow start of year pop as input
No default assumption of ERF shape: modified death probabilities calculated (AirQ+ approach) using correctly re-scaled RR (GeLuft approach); modification_factor = 1 / rr_conc
Survival probability is applied to entry population (AirQ+ approach)
Pre-mature deaths and YLL calculated as difference between population of scenario A & B (AirQ+ approach)
Pre-mature deaths due to air pollution assigned to year of analysis (AirQ+ approach)
Population projection: give user options to choose total (option) or natural (default) death probability (AirQ+ & GeLuft approaches combined); for the option total deaths, allow vector or correction factor as input; if user only enters one vector of death probabilities, assume that natural = total deaths

best-cost / best-cost_WPs

Lifetable calculations #211