Open ewdavis opened 3 years ago
Example of a bad fit:
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin
data_proportion = 0.45
def ls_func(t, params):
ret = odeint(deriv, y0, t, args=tuple(params))
return ret[:,1]
def residual(params):
trainT = t[:int(len(t)*data_proportion)]
return (cases[1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))
guess = [r, alpha]
tup = least_squares(residual, guess)#, bounds=(4, 7))
c = tup.x
#popt, pcov = curve_fit(deriv, cases, t)
ret = odeint(deriv, y0, t, args=tuple(c))
S, I, R = ret.T
fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c, c[0]/c[1], tup.cost)
I spent a little time trying out the approach of keeping a fixed R0
, with the (hopefully correct?) understanding that R0 = r / alpha
:
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin
data_proportion = 0.45
R0 = 1.01
def ls_func(t, params):
ret = odeint(deriv, y0, t, args=tuple( [ params[0] * R0, params[0] ] ))
return ret[:,1]
def residual(params):
trainT = t[:int(len(t)*data_proportion)]
return (state['daily_cases'][1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))
guess = [0.5] # We provide alpha = 0.5 as the initial guess, and assume that r = alpha * R0.
tup = least_squares(residual, guess)#, bounds=(4, 7))
c = tup.x
#popt, pcov = curve_fit(deriv, cases, t)
ret = odeint(deriv, y0, t, args=tuple( [ c[0] * R0, c[0] ] ))
S, I, R = ret.T
fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c[0] * R0, c[0], tup.cost)
Unfortunately, so far, it seems that the results are quite sensitive to the choice of R0
: with R0 = 1.01
, we are significantly underestimating the number of infections; with R0 = 1.03
, we are already overestimating. (From what I could tell from a quick web search, real world R0
values might be significantly higher.)
I also tried smoothing the data, to produce weekly, rather than daily cases, but this did not seem to help much.
I will take another look at this tomorrow; I wonder if it may be worth trying out a different dataset for this.
Another thought is that while we are fitting the number of individuals in the I
compartment to the number of daily cases, the two quantities do not quite capture the same thing: the former corresponds to the total number of infections at a given time, while the latter is the number of new infections (if I am not mistaken). I tried to instead fit the number of cumulative cases to the number of individuals in the I
and R
compartments, by making the following changes:
'daily_cases'
with 'cumsum_cases'
t = 0
, the number of cases is 0
, as follows: state['cumsum_cases'] = state['cumsum_cases'] - 27270
I+R
instead of I
ls_func
, returning ret[:,1] + ret[:,2]
, rather than ret[:,1]
I found that in some cases, there seems to be an issue with least squares getting stuck in a local minimum, and producing a bad fit (e.g., for data_proportion = 0.95
). Changing the initial conditions to something like r = 10, alpha = 10
seemed to help with this. I also noticed that sometimes, we see divide by zero encountered in double_scalars
warnings (e.g., for data_proportion = 0.6
), though a fit is still produced. Here are example fits for data_proportion = 0.45
(for lower values, fits tend to be poor) and data_proportion = 0.55
:
To provide baselines for comparison, we can plot "constant model" and "linear model" forecasts by adding the following code:
constant_forecasts = [ state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] * ( len(t) - int(len(t)*data_proportion) )
slope = state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] - state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 2 ]
linear_forecasts = [ [ state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] + slope * ( i + 1 ) for i in range( len(t) - int(len(t)*data_proportion) ) ]
plt.plot(t[int(len(t)*data_proportion):], constant_forecasts, label="Constant Baseline")
plt.plot(t[int(len(t)*data_proportion):], linear_forecasts, label="Linear Baseline")
For data_proportion = 0.45
and data_proportion = 0.55
(as before), this gives us:
I didn't get around to trying the trick with the curve width today, but did get enough data cleaned up and ready for us to try these techniques.
We should chat early next week or tomorrow about trying to get some of these functionalities ready for @m10f so it can be exposed.
Simulate will deliver the results for a fitted model, but we need some way to expose these baselines for skill assessment.
@galois-agrushin You've been working with a lot of the CDC data, any chance we can pull their predictions and align them to the data some how?
@ewdavis I think that it should be possible to take the forecasts of some CDC models and add them to the GDA data, showing something along the lines of what is given here. From what I can tell so far, the CDC and GDA data for Georgia is not exactly the same, but is not too dissimilar.
Also, I tried out one more idea: we can fit I+R
to cumulative cases (as described above), and then obtain the number of predicted daily cases as follows: forecasted_daily_cases = [ (I+R)[i+1] - (I+R)[i] for i in range( len(t) - 1 ) ]
. This seems to produce more accurate forecasts, compared with fitting I
directly to the number of daily cases. The full code is below:
import pandas as pd
import numpy as np
import re
import math
from scipy.integrate import odeint
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin
from datetime import datetime
%matplotlib inline
plt.style.use('ggplot')
def convertDate(theDate):
date_format = "%Y-%m-%d"
start = "2020-05-01"
a = datetime.strptime(start, "%Y-%m-%d")
b = datetime.strptime(theDate, "%Y-%m-%d")
delta = b-a
return (int(delta.days))
gdaData = pd.read_csv('ctp_state_data.tsv',sep='\t')
state = gdaData[gdaData['state'] == "GA"]
state = state[['date', 'daily_cases', 'cumsum_cases']]
state['cumsum_cases'] = state['cumsum_cases'] - 27270
state['date'] = state['date'].apply(convertDate)
state = state[state['date'] <= 150]
S0 = 21480000
I0 = 1
R0 = 0
r = 10.0
alpha = 10.0
def deriv(y, t, r, alpha):
S, I, R = y
N = S + I + R
dSdt = -r * (S * I)/N
dIdt = r * (S * I)/N - alpha * I
dRdt = alpha * I
return dSdt, dIdt, dRdt
y0 = S0, I0, R0
tmin = 0
tmax = 150
t = list(range(tmin, tmax))
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin
data_proportion = 0.45
def ls_func(t, params):
ret = odeint(deriv, y0, t, args=tuple(params))
return ret[:,1] + ret[:,2]
def residual(params):
trainT = t[:int(len(t)*data_proportion)]
return (state['cumsum_cases'][1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))
guess = [r, alpha]
tup = least_squares(residual, guess)#, bounds=(4, 7))
c = tup.x
#popt, pcov = curve_fit(deriv, cases, t)
ret = odeint(deriv, y0, t, args=tuple(c))
S, I, R = ret.T
forecasted_daily_cases = [ (I+R)[i+1] - (I+R)[i] for i in range( len(t) - 1 ) ]
constant_forecasts = [ state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] * ( len(t) - int(len(t)*data_proportion) )
slope = state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] - state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 2 ]
linear_forecasts = [ [ state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] + slope * ( i + 1 ) for i in range( len(t) - int(len(t)*data_proportion) ) ]
fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t[int(len(t)*data_proportion):], constant_forecasts, label="Constant Baseline")
# plt.plot(t[int(len(t)*data_proportion):], linear_forecasts, label="Linear Baseline")
plt.plot(t[1:], forecasted_daily_cases, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c, c[0]/c[1], tup.cost)
It gives us:
@ewdavis I think this might be something close to what we are looking for: https://towardsdatascience.com/bayesian-inference-and-differential-equations-cb646b50f148
This example uses Stan, which I'm not familiar with, to do the bayesian inference and fitting. I think with more time it should be possible to use bayesian inference to estimate the ode parameters, but probably not by tomorrow.
I was able to incorporate forecasts made by the COVIDhub-ensemble
model for Georgia, via the following code, which requires the four attached CSV files (with forecast data) to be placed in the working directory.
cdc_forecast_dates = []
cdc_forecasts = []
for forecast_horizon in range(1, 5):
cdc_forecast = pd.read_csv( 'COVIDhub-ensemble_13_' + str(forecast_horizon) + '.csv' ) # 13 is the location code for Georgia
cdc_forecast['date'] = cdc_forecast['date'].apply(convertDate)
cdc_forecast['forecast_date'] = cdc_forecast['forecast_date'].apply(convertDate)
# We obtain the forecast that was made at most recent forecast_date that falls within our training data window (i.e., that is no greater than int(len(t)*data_proportion) - 1).
cdc_forecast = cdc_forecast.loc[ cdc_forecast['forecast_date'] <= int(len(t)*data_proportion) - 1 ]
if len( cdc_forecast ) > 0:
cdc_forecast_dates.extend( range( cdc_forecast['date'][ len(cdc_forecast) - 1 ] - 6, cdc_forecast['date'][ len(cdc_forecast) - 1 ] + 1 ) ) # The forecast is made for the week that ends on the given date.
cdc_forecasts.extend( [ cdc_forecast['forecast'][ len(cdc_forecast) - 1 ] / 7 ] * 7 ) # The forecast provides the number of cases for the week; we divide it evenly across the seven days.
plt.plot(cdc_forecast_dates, cdc_forecasts, label="COVIDhub-ensemble")
COVIDhub-ensemble_13_3.csv COVIDhub-ensemble_13_4.csv COVIDhub-ensemble_13_1.csv COVIDhub-ensemble_13_2.csv
When combining this with the code that I showed yesterday, we get the following, for data_proportion = 0.55
:
Because COVIDhub-ensemble
forecasts were not available early in the pandemic, they will not be shown if data_proportion
is much lower than 0.55
. Also, because these forecasts provide weekly, rather than daily case counts, the forecast curve appears jagged, and does not always align with the "Constant Baseline" curve.
If needed, we could incorporate forecasts from other models, and for other states, but because I am short on hours for the project, I will not be able to spend a large amount of time on this.
I tried varying the initial conditions (in particular, population size S0
and initial number of infected individuals I0
) via a brute-force search, and found that I was getting better fits by using much larger population sizes. In particular, if we run the code that I posted earlier, but with S0 = 40000000
and I0 = 20
, we get (for data_proportion = 0.55
):
For Florida, we can use the same values of S0
and I0
as given above, and obtain:
Note that I replaced the line state['cumsum_cases'] = state['cumsum_cases'] - 27270
with state['cumsum_cases'] = state['cumsum_cases'] - state['cumsum_cases'][ state.first_valid_index() ]
, so that it is not specific to Georgia.
All of this is, of course, not particularly grounded in reality, since population sizes S0
are much larger than they actually are, and initial infected counts I0
are smaller. The fundamental issue, I think, is that the SIR model is not versatile enough to both capture the data and have realistic parameters, so it has to be hacked somewhat.
Also, thinking about it some more, I can see why we were (and still are) getting low R0
values: in our datasets, which capture an early spike in COVID cases (rather than the full evolution of a pandemic), only a small subset of the population becomes infected; this would only happen with a low R0
, I think.
I managed to prove out the GDA data sets available here:
https://drive.google.com/drive/folders/1eYhyDxQgQfOsVzJ30njNOk7ba-EegiBB
working specifically with
ctp_state_data.tsv
.You can process the data with the following code:
Which should result in the following plot:
We can now attempt to naively "fit" the data with arbitrary parameters:
Resulting in:
We can then attempt to fit the data:
Primary problem: when you dial down
data_proportion
the fit gets worse, and overfits the initial transients of the curve.Idea: Can we use R0 to bound the search space based on prior infections/estimates?