Generate random numbers from unknown distribution given mean and CI

apascualgarcia commented 4 years ago

@JudithBouman2412 and @ecam85

The parameters that these guys are gathering seem to be non-normal. Do you know if there is a typical statistical distribution for these parameters that is used? Alternatively, any nonparametric computational procedure to generate them.

They provide mean and CI only.

JudithBouman2412 commented 4 years ago

What distribution do they have? As far as I know, they are usually exponential.

apascualgarcia commented 4 years ago

It is unknown as far as -I know. @jordan-klein and @Jennifer-Villers could you please clarify which is the statistical distribution of the parameters you are gathering?

Doesn't look like an exponential because the CI are left tailed around the mean. It may be a beta distribution. Otherwise we will need to fit it to a normal with a <0 skew-parameter

jordan-klein commented 4 years ago

The proportion asymptomatic follows a binomial distribution with exact Clopper–Pearson confidence intervals.

jordan-klein commented 4 years ago

Hospitalization rate is a known parameter, can we treat it as fixed in the model without an associated probability distribution?

Jennifer-Villers commented 4 years ago

The duration of the incubation period is a log-normal distribution.
The duration of the latent period is calculated as the difference between the duration of the incubation period (log-normal) and the duration of the pre-symptomatic period (described below).
The duration of the pre-symptomatic period is assessed as follows (I copied the paragraph from the paper):

"We fitted a gamma distribution to the transmission pairs data to estimate the serial interval distribution. We used a published estimate of the incubation period distribution to infer infectiousness with respect to symptom onset from the first 425 patients with COVID-19 in Wuhan with detailed exposure history1. We considered that infected cases would become infectious at a certain time point before or after illness onset (tS1). Infectiousness—that is, transmission probability to a secondary case—would then increase until reaching its peak (Fig. 1). The transmission event would occur at time tI with a probability described by the infectiousness profile βc(tI − tS1) relative to the illness onset date, assuming a gamma distribution β(t) with a time shift c to allow for start of infectiousness c days prior to symptom onset; that is, βc(t) = β(t + c). The secondary case would then show symptoms at time tS2, after the incubation period that is assumed to follow a lognormal distribution g(tS2 − tI). Hence the observed serial intervals distribution f(tS2 − tS1) would be the convolution between the infectiousness profile and incubation period distribution. We constructed a likelihood function based on the convolution, which was fitted to the observed serial intervals, allowing for the start of infectiousness around symptom onset and window of symptom onset (tS1l, tS1u), given by

L(tS1u,tS1l,tS2|θ)=∫tS1ltS1u∫−∞tS2βc(tI−tS1)g(tS2−tI)dtIdtS1 Parameters θ, including the gamma distribution parameters and the start of infectiousness, were estimated using maximum likelihood. The 95% CIs were obtained by bootstrapping with 1,000 replications. We also performed sensitivity analyses by fixing the start of infectiousness from day 1 to 7 before symptom onset and inferred the infectiousness profile."

Reference: https://www.nature.com/articles/s41591-020-0869-5#citeas

Jennifer-Villers commented 4 years ago

Regarding time from onset of symptoms to hospitalization, one of the study reports a mean and standard deviation, suggesting a normal distribution, the other two studies report median and interquartile range and do not provide information regarding the type of distribution.

apascualgarcia commented 4 years ago

Hospitalization rate is a known parameter, can we treat it as fixed in the model without an associated probability distribution?

It may be different across countries but yes, let's do that. Could you please aggregate the data in our three classes considering that our populations are 10 years older than their real age? So 10 years old have hospitalization rates of 20, 20 of 30 and son on. Note that it is also related to this ongoing open issue so perhaps better to have a final decision first: https://github.com/crowdfightcovid19/req-550-Syria/issues/11

apascualgarcia commented 4 years ago

So I should implement a log-normal, gamma, binomial, normal and a missing distribution? (crying) Thanks guys, I will do my best!

Jennifer-Villers commented 4 years ago

So I should implement a log-normal, gamma, binomial, normal and a missing distribution? (crying) Thanks guys, I will do my best!

@apascualgarcia Would you prefer if I search for other references for the parameters?

apascualgarcia commented 4 years ago

Hi @Jennifer-Villers thanks for asking.

The main problem comes from the presymptomatic compartment, for which I guess it is difficult to find data. The current situation is that it is relatively easy to find the parameters to generate random values from lognormal and gamma distributions to model the incubation and serial intervals they provide in the paper, respectively. The tricky part is that they then make a convolution to estimate the mean and CI of the presymptomatic period. But we want to generate random numbers and this distribution is unknown, and it doesn't fit to any typical distribution (if you note it has a long left tailed which is not found very often).

Generating values for incubation periods, then serial periods and substracting them will not work because there will often be serial periods larger than incubation periods which would generate negative presymptomatic periods. Actually, the funny thing is that they provide a mean for the serial intervals larger than the incubation periods, which is a scenario opposite to the one that justifies having a presymptomatic compartment.

Jennifer-Villers commented 4 years ago

Thank you, Alberto, for sharing that with me.

"Actually, the funny thing is that they provide a mean for the serial intervals larger than the incubation periods, which is a scenario opposite to the one that justifies having a presymptomatic compartment."

I agree but they also provide a median for the serial intervals that is similar to the incubation period, which means that at least half of the people transmitted the disease before or when they started showing symptoms, which is not a small number and I believe justifies the utilization of a pre-symptomatic compartment. When I look at the distribution, it seems that most people transmit the disease right before showing symptoms but a few of them transmit it much later, which would explain the difference between the mean and the median. However, I have no idea what that distribution could be and how to use it for our purpose. I wish I could be more useful...

Please let me know if you see something that I could do. I will have a closer look at references used in other modeling papers to see if there are other ways to calculate the duration of the latent and pre-symptomatic periods.

apascualgarcia commented 4 years ago

which is not a small number and I believe justifies the utilization of a pre-symptomatic compartment. When I look at the distribution, it seems that most people transmit the disease right before showing symptoms but a few of them transmit it much later

Yes, I agree that is relevant. I was just willing to stress the fact that knowing the distributions independently is not helpful because the overall properties go into a different direction to what we want to, it would be needed to build something like a conditional distribution (if the incubation is longer the serial interval is shorter and the other way around)

Let me know if you find something else in the literature, thanks

Jennifer-Villers commented 4 years ago

@apascualgarcia Hi Alberto, I have not yet found other publications that specifically try to estimate the duration of the pre-symptomatic period. I found one publication that assesses the percentage of pre-symptomatic transmission (Wellcome Open Research) and two publications that estimate the serial interval (Shenzhen and Japanese group). In the African modeling paper, they use gamma distributions for all their parameters but I don't know based on what they've decided to use gamma distributions. They just cite the papers I've attached above.

Edit: In a study from the CDC analyzing 7 clusters of transmission, they could establish that in 4 of those clusters presymptomatic transmission exposure occurred 1–3 days before the source patient developed symptoms. This still does not provide a confidence interval but at least it is aligned with the numbers we are using.

Edit number 2: I've also found a meta-analysis paper on the subject. I haven't gotten the time to read it yet but it may contain more useful information.

apascualgarcia commented 4 years ago

Thanks @Jennifer-Villers for all the research! I've been looking for a distribution that match the mean and CI interval provided in the Nature paper and I've been able to generate a distribution with the following mean and CI (95%):

CIlow= 0.8, Mean= 2.3, CIhigh = 3.0 (Nature paper) CIlow= 0.8, Mean= 2.16, CIhigh = 3.0 (proposed distribution)

The interesting thing of this distribution is that it is a Gompertz distribution, that is often used in demography. It would be interesting gathering more data and see if indeed it is a good fit, I think it may open an avenue to interpret the reasons behind the variability of the presymptomatic period (not for this work though). Does it sound reasonable guys for you to use this? I will commit a figure in data/estimation_parameters if you want to have a look at how the distro looks like for these params.

jordan-klein commented 4 years ago

@apascualgarcia I think a Gompertz distribution makes sense for this context if we think about the age pattern of mortality being analogous to the time pattern (in days) of becoming infectious. Please commit your figure if you've made one so I can take a look.

apascualgarcia commented 4 years ago

@jordan-klein commit done!

jordan-klein commented 4 years ago

@apascualgarcia looks good!

apascualgarcia commented 4 years ago

@Jennifer-Villers and @jordan-klein I generated incubation and presymptomatic times and there are some chances to generate exposed times smaller than 0 (although very small p < 10e-04). There should be however a minimum incubation time Tmin before one person becomes infectious and gets into the presymptomatic compartment (in other words, a person should spend a minimum time Tmin in the E compartment to develop the disease). Any idea about which time would be reasonable? I can imagine that the life cycle of the virus is well understood at least in cell cultures or so. What I am considering is to fix any generated time values of E below Tmin to Tmin, what do you think?

Jennifer-Villers commented 4 years ago

@Jennifer-Villers and @jordan-klein I generated incubation and presymptomatic times and there are some chances to generate exposed times smaller than 0 (although very small p < 10e-04). There should be however a minimum incubation time Tmin before one person becomes infectious and gets into the presymptomatic compartment (in other words, a person should spend a minimum time Tmin in the E compartment to develop the disease). Any idea about which time would be reasonable? I can imagine that the life cycle of the virus is well understood at least in cell cultures or so. What I am considering is to fix any generated time values of E below Tmin to Tmin, what do you think?

Hi Alberto, Thanks for your question. According to the CDC, it takes less than 24h for the virus to replicate (Ref). I could not find more precise data on SARS-CoV-2. I could find two papers on SARS-CoV (from 2003): one says that the first replication cycle takes 7h and the other paper says 24h. Having a closer look at the chart from the CDC, there seem to be viral progeny in the cytoplasm already after 12 hours. I don't know whether those levels are enough to be infectious.

Based on the CDC data, I would be tempted to go with either 12 hours or 24 hours, but definitely no more than that (as what we are looking for is a lower bound).

apascualgarcia commented 4 years ago

@Jennifer-Villers and @jordan-klein Thanks Jennifer, what about generating for those another random number with mean 16 and 99CI [8-24] (normal)

Jennifer-Villers commented 4 years ago

@Jennifer-Villers and @jordan-klein Thanks Jennifer, what about generating for those another random number with mean 16 and 99CI [8-24] (normal)

Sounds good to me, as long as it doesn't add unnecessary complexity.

crowdfightcovid19 / req-550-Syria

Generate random numbers from unknown distribution given mean and CI #10