CDCgov / Rt-without-renewal

https://cdcgov.github.io/Rt-without-renewal/
Apache License 2.0
21 stars 3 forks source link

Create a new submodule EpiDist/EpiDelay #386

Open seabbs opened 4 months ago

seabbs commented 4 months ago

We currently use delays in several places and could be doing so via a submodule. This could be called several things.

I see it being used to:

To start (i.e to resolve this issue) we would need to move our current Fixed delay functionality into a submodule. We would then need to make issues to implement our censoring_pmf approach as a in model fitted delay and to move our generation time model to be a model input.

seabbs commented 4 months ago

@SamuelBrand1 I am going to have a go at this soon. Any thoughts on naming?

SamuelBrand1 commented 4 months ago

I'm leaning towards EpiDistributions or possibly EpiAwareDistributions although I think that is maybe too anchored.

My reasoning is that what we are fundamentally doing in the three examples you quote is operating on distributions to create (in some sense) good discrete distributions for epidemiological modelling. Either because of enforcing discrete time dynamics or reasoning on observation censoring (or both).

I think more verbose and making an implicit link to Distributions.jl works in typical Julia module naming conventions.

seabbs commented 4 months ago

So I am expecting this to also hold all the nonparametric hazard components that epinowcast deals with so for that reason I think it might make sense to use a more general name?

SamuelBrand1 commented 4 months ago

That includes things that are discretization of things like Survival.jl which is from the JuliaStats org (as is Distributions.jl)... Maybe EpiStats?

More left field, the non-hazard models are really special cases of a hazard model so EpiHazards.jl?

seabbs commented 4 months ago

I feel like EpiHazard might lose people (and not everything would need a hazard approach). Maybe its two interacting submodels that have the same types (and so can be composed together...)?

SamuelBrand1 commented 4 months ago

I feel like EpiHazard might lose people (and not everything would need a hazard approach). Maybe its two interacting submodels that have the same types (and so can be composed together...)?

E.g. something like EpiDists and EpiSurvival?

seabbs commented 4 months ago

yes but they would both need to have the same abstract parent type so that they hook in to i.e latent delay in the same way. I think probably it would be EpiHazard though as EpiSurvival would include dists (i.e distributional survival modelling)

seabbs commented 4 months ago

Some proposed models (note these are really pseudo code) to go in this module in some form.

@model function censoring_delay_model(no_at_delay)
  # Parameters
    meanlog ~ Normal(0, 10)
    sdlog ~ truncated(Normal(0, 10), 0, Inf)

.   pmf = censored_pmf(LogNormal(meanlog, sdlog); D = length(no_at_delay))

.   for i in eachindex(pmf)
.       if no_at_delay[i] != 0
            @addlogprob! no_at_delay[i] * log(pmf[i])
        end
.   end
end
@model function censored_truncated_delay_model(no_at_delay_by_obs_time)
  # Parameters
    meanlog ~ Normal(0, 10)
    sdlog ~ truncated(Normal(0, 10), 0, Inf)

.   pmfs = map(no_at_delay_by_obs_time) do i
        censored_pmf(LogNormal(meanlog, sdlog); D = i.obs_time)
.   end
.   
.   for i in eachindex(pmfs)
        pmf = pmfs[i]
.       no_at_delay = no_at_delay_by_obs_time[i].no_at_delay

.       for j in eachindex(pmf)
.           if no_at_delay[j] != 0
                @addlogprob! no_at_delay[j] * log(pmf[j])
            end
.       end
    end
end

These both make use of our censored_pmf function (https://github.com/CDCgov/Rt-without-renewal/blob/main/EpiAware/src/EpiAwareUtils/censored_pmf.jl) which uses numerical integration to integrate over the first interval and creat a primary event censoring adjusted cdf and then uses differences of this cdf to create a double interval censored pmf.

These implementations are not generative but using switches could be made to be fairly easily

There are more efficient ways you could code this but as its a cohort approach it scales by D and unique obs time versus the number of observations and so should perform much better in high data settings. The daily event window limitation is really just a software engineering limitation (you need a higher dimensional grid). This is also currently hard coded to assume a uniform primary event window (the secondary event dist doesn't matter) but this could also be changed again with engineering vs any methodological work.

For interest @natemcintosh, @kgostic. I don't think there is anything limiting us from putting similar approaches in epinowcast/epidist fairly soon once we have the package infra locked down. if the code isn't very helpful can write some maths down to summarise this.

In stan land this is some useful context for what is going on: https://mc-stan.org/docs/stan-users-guide/efficiency-tuning.html#bernoulli-sufficient-statistics

Ideally we would do some evaluation for this model and I think this could be done by not me in a way that would be very useful from several angles.

SamuelBrand1 commented 3 months ago

The daily event window limitation is really just a software engineering limitation (you need a higher dimensional grid).

Just to add censored_pmf does already have this, and it by default (if no D variable is provided the length of the discretised distribution is chosen to be the smallest where 99% of the continuous distribution mass is covered.)

seabbs commented 3 months ago

The issue with this approach is you need to approximate obs_time (which is really a combination of some known time and the p event time using the lower bound of the p event time.

This means that for longer p event windows you will start to have small amounts of truncation bias. The solution is to include the obs time in the integral I think which would be worth exploring once this has been implemented but may also not really turn out to be needed.

seabbs commented 3 months ago

Just to add censored_pmf does already have this, and it by default (if no D variable is provided the length of the discretised distribution is chosen to be the smallest where 99% of the continuous distribution mass is covered.)

The issue I was referring to here is the lookup between your pmfs and the model. You could also encode this so that the pmf is calculated for each independent d cohort and then this wouldn't be a problem but you would have a slightly less efficient model as you could be solving each cdf up to twice (as two pmfs might share the same bound (i.e lower for one and upper for the other).

SamuelBrand1 commented 3 months ago

Noting this package https://github.com/epinowcast/primarycensoreddist

seabbs commented 3 months ago

We talked f2f to face and think that the julia approach to this is a Distributions.jl wrapper like Truncated