GaloisInc / ASKE-E

3 stars 0 forks source link

Assisted data fitting for 1,2,3,4 week prognostic generation #102

Open ewdavis opened 3 years ago

ewdavis commented 3 years ago

I managed to prove out the GDA data sets available here:

https://drive.google.com/drive/folders/1eYhyDxQgQfOsVzJ30njNOk7ba-EegiBB

working specifically with ctp_state_data.tsv.

You can process the data with the following code:

import pandas as pd
import numpy as np
import re
from scipy.integrate import odeint
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin

from datetime import datetime

%matplotlib inline
plt.style.use('ggplot')

def convertDate(theDate):
    date_format = "%Y-%m-%d"
    start = "2020-05-01"
    a = datetime.strptime(start, "%Y-%m-%d")
    b = datetime.strptime(theDate, "%Y-%m-%d")
    delta = b-a
    return (int(delta.days))

gdaData = pd.read_csv('ctp_state_data.tsv',sep='\t')
state = gdaData[gdaData['state'] == "GA"]
state = state[['date', 'daily_cases']]
state['date'] = state['date'].apply(convertDate)
state = state[state['date'] <= 150]

fig = plt.figure(figsize=(15,7))
plt.plot(state['date'], state['daily_cases'])

Which should result in the following plot:

image

We can now attempt to naively "fit" the data with arbitrary parameters:

S0 = 21480000
I0 = 1
R0 = 0
r = 4.0
alpha = 4.0

def deriv(y, t, r, alpha):
    S, I, R = y
    N = S + I + R
    dSdt = -r * (S * I)/N
    dIdt = r * (S * I)/N - alpha * I
    dRdt = alpha * I
    return dSdt, dIdt, dRdt

y0 = S0, I0, R0

tmin = 0
tmax = 150
t = list(range(tmin, tmax))

ret = odeint(deriv, y0, t, args=(r, alpha))
S, I, R = ret.T

fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="SIR Model - Random Params")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)

Resulting in:

image

We can then attempt to fit the data:

S0 = 21480000
I0 = 1
R0 = 0
r = 4.0
alpha = 4.0

def deriv(y, t, r, alpha):
    S, I, R = y
    N = S + I + R
    dSdt = -r * (S * I)/N
    dIdt = r * (S * I)/N - alpha * I
    dRdt = alpha * I
    return dSdt, dIdt, dRdt

y0 = S0, I0, R0

tmin = 0
tmax = 150
t = list(range(tmin, tmax))

ret = odeint(deriv, y0, t, args=(r, alpha))
S, I, R = ret.T

fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="SIR Model - Random Params")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)

image

Primary problem: when you dial down data_proportion the fit gets worse, and overfits the initial transients of the curve.

Idea: Can we use R0 to bound the search space based on prior infections/estimates?

ewdavis commented 3 years ago

Example of a bad fit:

from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin

data_proportion = 0.45

def ls_func(t, params):
    ret = odeint(deriv, y0, t, args=tuple(params))
    return ret[:,1]

def residual(params):
    trainT = t[:int(len(t)*data_proportion)]
    return (cases[1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))

guess = [r, alpha]

tup = least_squares(residual, guess)#, bounds=(4, 7))

c = tup.x

#popt, pcov = curve_fit(deriv, cases, t)

ret = odeint(deriv, y0, t, args=tuple(c))
S, I, R = ret.T

fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c, c[0]/c[1], tup.cost)

image

galois-agrushin commented 3 years ago

I spent a little time trying out the approach of keeping a fixed R0, with the (hopefully correct?) understanding that R0 = r / alpha:

from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin

data_proportion = 0.45

R0 = 1.01

def ls_func(t, params):
    ret = odeint(deriv, y0, t, args=tuple( [ params[0] * R0, params[0] ] ))
    return ret[:,1]

def residual(params):
    trainT = t[:int(len(t)*data_proportion)]
    return (state['daily_cases'][1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))

guess = [0.5] # We provide alpha = 0.5 as the initial guess, and assume that r = alpha * R0.

tup = least_squares(residual, guess)#, bounds=(4, 7))

c = tup.x

#popt, pcov = curve_fit(deriv, cases, t)

ret = odeint(deriv, y0, t, args=tuple( [ c[0] * R0, c[0] ] ))
S, I, R = ret.T

fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t, I, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c[0] * R0, c[0], tup.cost)

Unfortunately, so far, it seems that the results are quite sensitive to the choice of R0: with R0 = 1.01, we are significantly underestimating the number of infections; with R0 = 1.03, we are already overestimating. (From what I could tell from a quick web search, real world R0 values might be significantly higher.)

I also tried smoothing the data, to produce weekly, rather than daily cases, but this did not seem to help much.

I will take another look at this tomorrow; I wonder if it may be worth trying out a different dataset for this.

galois-agrushin commented 3 years ago

Another thought is that while we are fitting the number of individuals in the I compartment to the number of daily cases, the two quantities do not quite capture the same thing: the former corresponds to the total number of infections at a given time, while the latter is the number of new infections (if I am not mistaken). I tried to instead fit the number of cumulative cases to the number of individuals in the I and R compartments, by making the following changes:

I found that in some cases, there seems to be an issue with least squares getting stuck in a local minimum, and producing a bad fit (e.g., for data_proportion = 0.95). Changing the initial conditions to something like r = 10, alpha = 10 seemed to help with this. I also noticed that sometimes, we see divide by zero encountered in double_scalars warnings (e.g., for data_proportion = 0.6), though a fit is still produced. Here are example fits for data_proportion = 0.45 (for lower values, fits tend to be poor) and data_proportion = 0.55:

image image

galois-agrushin commented 3 years ago

To provide baselines for comparison, we can plot "constant model" and "linear model" forecasts by adding the following code:

constant_forecasts = [ state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] * ( len(t) - int(len(t)*data_proportion) )
slope = state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] - state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 2 ]
linear_forecasts = [ [ state['cumsum_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] + slope * ( i + 1 ) for i in range(  len(t) - int(len(t)*data_proportion) ) ]
plt.plot(t[int(len(t)*data_proportion):], constant_forecasts, label="Constant Baseline")
plt.plot(t[int(len(t)*data_proportion):], linear_forecasts, label="Linear Baseline")

For data_proportion = 0.45 and data_proportion = 0.55 (as before), this gives us:

image image

ewdavis commented 3 years ago

I didn't get around to trying the trick with the curve width today, but did get enough data cleaned up and ready for us to try these techniques.

We should chat early next week or tomorrow about trying to get some of these functionalities ready for @m10f so it can be exposed.

Simulate will deliver the results for a fitted model, but we need some way to expose these baselines for skill assessment.

@galois-agrushin You've been working with a lot of the CDC data, any chance we can pull their predictions and align them to the data some how?

galois-agrushin commented 3 years ago

@ewdavis I think that it should be possible to take the forecasts of some CDC models and add them to the GDA data, showing something along the lines of what is given here. From what I can tell so far, the CDC and GDA data for Georgia is not exactly the same, but is not too dissimilar.

Also, I tried out one more idea: we can fit I+R to cumulative cases (as described above), and then obtain the number of predicted daily cases as follows: forecasted_daily_cases = [ (I+R)[i+1] - (I+R)[i] for i in range( len(t) - 1 ) ]. This seems to produce more accurate forecasts, compared with fitting I directly to the number of daily cases. The full code is below:

import pandas as pd
import numpy as np
import re
import math
from scipy.integrate import odeint
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin

from datetime import datetime

%matplotlib inline
plt.style.use('ggplot')

def convertDate(theDate):
    date_format = "%Y-%m-%d"
    start = "2020-05-01"
    a = datetime.strptime(start, "%Y-%m-%d")
    b = datetime.strptime(theDate, "%Y-%m-%d")
    delta = b-a
    return (int(delta.days))

gdaData = pd.read_csv('ctp_state_data.tsv',sep='\t')
state = gdaData[gdaData['state'] == "GA"]
state = state[['date', 'daily_cases', 'cumsum_cases']]
state['cumsum_cases'] = state['cumsum_cases'] - 27270
state['date'] = state['date'].apply(convertDate)
state = state[state['date'] <= 150]
S0 = 21480000
I0 = 1
R0 = 0
r = 10.0
alpha = 10.0

def deriv(y, t, r, alpha):
    S, I, R = y
    N = S + I + R
    dSdt = -r * (S * I)/N
    dIdt = r * (S * I)/N - alpha * I
    dRdt = alpha * I
    return dSdt, dIdt, dRdt

y0 = S0, I0, R0

tmin = 0
tmax = 150
t = list(range(tmin, tmax))
from scipy.optimize import minimize, curve_fit, least_squares, leastsq, fmin

data_proportion = 0.45

def ls_func(t, params):
    ret = odeint(deriv, y0, t, args=tuple(params))
    return ret[:,1] + ret[:,2]

def residual(params):
    trainT = t[:int(len(t)*data_proportion)]
    return (state['cumsum_cases'][1:int(len(t)*data_proportion)] - ls_func(trainT[1:], params))

guess = [r, alpha]

tup = least_squares(residual, guess)#, bounds=(4, 7))

c = tup.x

#popt, pcov = curve_fit(deriv, cases, t)

ret = odeint(deriv, y0, t, args=tuple(c))
S, I, R = ret.T

forecasted_daily_cases = [ (I+R)[i+1] - (I+R)[i] for i in range( len(t) - 1 ) ]

constant_forecasts = [ state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] * ( len(t) - int(len(t)*data_proportion) )
slope = state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] - state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 2 ]
linear_forecasts = [ [ state['daily_cases'][ state.first_valid_index() + int(len(t)*data_proportion) - 1 ] ] + slope * ( i + 1 ) for i in range(  len(t) - int(len(t)*data_proportion) ) ]

fig = plt.figure(figsize=(24,12))
plt.plot(state['date'], state['daily_cases'])
plt.plot(t[int(len(t)*data_proportion):], constant_forecasts, label="Constant Baseline")
# plt.plot(t[int(len(t)*data_proportion):], linear_forecasts, label="Linear Baseline")
plt.plot(t[1:], forecasted_daily_cases, label="Fitted SIR")
plt.xlabel("Time", fontsize=30)
plt.ylabel("Individuals", fontsize=30)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=30)
print(c, c[0]/c[1], tup.cost)

It gives us:

image

ylee22 commented 3 years ago

@ewdavis I think this might be something close to what we are looking for: https://towardsdatascience.com/bayesian-inference-and-differential-equations-cb646b50f148

This example uses Stan, which I'm not familiar with, to do the bayesian inference and fitting. I think with more time it should be possible to use bayesian inference to estimate the ode parameters, but probably not by tomorrow.

galois-agrushin commented 3 years ago

I was able to incorporate forecasts made by the COVIDhub-ensemble model for Georgia, via the following code, which requires the four attached CSV files (with forecast data) to be placed in the working directory.

cdc_forecast_dates = []
cdc_forecasts = []

for forecast_horizon in range(1, 5):
    cdc_forecast = pd.read_csv( 'COVIDhub-ensemble_13_' + str(forecast_horizon) + '.csv' ) # 13 is the location code for Georgia
    cdc_forecast['date'] = cdc_forecast['date'].apply(convertDate)
    cdc_forecast['forecast_date'] = cdc_forecast['forecast_date'].apply(convertDate)

    # We obtain the forecast that was made at most recent forecast_date that falls within our training data window (i.e., that is no greater than int(len(t)*data_proportion) - 1).
    cdc_forecast = cdc_forecast.loc[ cdc_forecast['forecast_date'] <= int(len(t)*data_proportion) - 1 ]
    if len( cdc_forecast ) > 0:
        cdc_forecast_dates.extend( range( cdc_forecast['date'][ len(cdc_forecast) - 1 ] - 6, cdc_forecast['date'][ len(cdc_forecast) - 1 ] + 1 ) ) # The forecast is made for the week that ends on the given date.
        cdc_forecasts.extend( [ cdc_forecast['forecast'][ len(cdc_forecast) - 1 ] / 7 ] * 7 ) # The forecast provides the number of cases for the week; we divide it evenly across the seven days.

plt.plot(cdc_forecast_dates, cdc_forecasts, label="COVIDhub-ensemble")

COVIDhub-ensemble_13_3.csv COVIDhub-ensemble_13_4.csv COVIDhub-ensemble_13_1.csv COVIDhub-ensemble_13_2.csv

When combining this with the code that I showed yesterday, we get the following, for data_proportion = 0.55:

image

Because COVIDhub-ensemble forecasts were not available early in the pandemic, they will not be shown if data_proportion is much lower than 0.55. Also, because these forecasts provide weekly, rather than daily case counts, the forecast curve appears jagged, and does not always align with the "Constant Baseline" curve.

If needed, we could incorporate forecasts from other models, and for other states, but because I am short on hours for the project, I will not be able to spend a large amount of time on this.

galois-agrushin commented 3 years ago

The forecasts for the model that I mentioned (though not the model itself); are included here. The repository is no longer maintained; more details are here.

The SEIR simulator is included here, along with learned parameters. However, the code for learning parameters could not be found.

galois-agrushin commented 3 years ago

I tried varying the initial conditions (in particular, population size S0 and initial number of infected individuals I0) via a brute-force search, and found that I was getting better fits by using much larger population sizes. In particular, if we run the code that I posted earlier, but with S0 = 40000000 and I0 = 20, we get (for data_proportion = 0.55):

image

For Florida, we can use the same values of S0 and I0 as given above, and obtain:

image

Note that I replaced the line state['cumsum_cases'] = state['cumsum_cases'] - 27270 with state['cumsum_cases'] = state['cumsum_cases'] - state['cumsum_cases'][ state.first_valid_index() ], so that it is not specific to Georgia.

All of this is, of course, not particularly grounded in reality, since population sizes S0 are much larger than they actually are, and initial infected counts I0 are smaller. The fundamental issue, I think, is that the SIR model is not versatile enough to both capture the data and have realistic parameters, so it has to be hacked somewhat.

Also, thinking about it some more, I can see why we were (and still are) getting low R0 values: in our datasets, which capture an early spike in COVID cases (rather than the full evolution of a pandemic), only a small subset of the population becomes infected; this would only happen with a low R0, I think.