cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Doctor Visits adjusted signal AUC does not match the raw signal #2045

Open nolangormley opened 2 months ago

nolangormley commented 2 months ago

Actual Behavior:

When looking at the data from the Doctor Visits signal, the day-adjusted signal does not seem to match the area under the curve of the raw signal. The sum of the values on the raw signal is 67.70 and the day-adjusted signal is 56.22.

docvisit

Expected behavior

@RoniRos and I were looking through this yesterday and it was our intuition that the AUC should match between these two signals.

Context

Here's some code to replicate the plot above

import wget

docvisit = wget.download("https://api.covidcast.cmu.edu/epidata/covidcast/csv?signal=doctor-visits:smoothed_cli&start_day=2024-05-29&end_day=2024-08-29&geo_type=nation")
docvisitadj = wget.download("https://api.covidcast.cmu.edu/epidata/covidcast/csv?signal=doctor-visits:smoothed_adj_cli&start_day=2024-05-29&end_day=2024-08-29&geo_type=nation")

df = pd.read_csv("covidcast-doctor-visits-smoothed_cli-2024-05-29-to-2024-08-29.csv")
dfadj = pd.read_csv("covidcast-doctor-visits-smoothed_adj_cli-2024-05-29-to-2024-08-29.csv")

df.time_value = pd.to_datetime(df.time_value, utc=True)
dfadj.time_value = pd.to_datetime(dfadj.time_value, utc=True)
dfadj = dfadj[['time_value', 'value']].rename(columns={'time_value':'time_value', 'value':'valueadj'})

foo = df[['time_value', 'value']].merge(dfadj, on='time_value', how='left')
foo.plot(x='time_value', y=['value', 'valueadj'])
nolangormley commented 2 months ago

I believe this was part of @rumackaaron 's work. Are we correct in assuming that these should match?

rumackaaron commented 2 months ago

Interesting find! Mathematically, they don't have to match and I think that's the expected behavior in this case. When creating the design matrix in weekday.py, the constraint is that $\sum{wd=0}^6 \alpha{wd} = 1$. After fitting the day-of-week parameters $\alpha$, we take the original signal $yt$ and multiply it by $\exp(\alpha{wd})$ to get the weekday-adjusted signal $y'_t$ (where $wd$ is the day-of-week of $t$).

For simplicity, say that there are only two days in the week. Let $\alpha_0 = -1$ and $\alpha_1 = 1$, and $y_0 = 5$ and $y_1$ = 1. The sum of the raw values $y$ is 6, and the sum of the weekday-adjusted values is $5\exp(-1) + \exp(1) = 4.55$. We see something similar here, where the sum of the adjusted signal is lower than the sum of the raw signal.

It may be possible to create a different constraint to ensure that (at least on the training data), the sum of the original signal is the same as that of the adjusted signal. I don't think it's possible to ensure that constraint holds over an arbitrary time interval while using multiplicative day-of-week effects.

P.S. I find it concerning that the "sawtooth" pattern is still present in the adjusted signal. I don't know what the training period is for fitting the day-of-week effects, but it may be worth experimenting to find an appropriate period that consistently removes the "sawtooth" pattern.

RoniRos commented 2 months ago

I don't think it's possible to ensure that constraint holds over an arbitrary time interval while using multiplicative day-of-week effects.

Indeed. In fact, it's not possible to ensure that with any modification (think of the special case of an interval of one day).

Even if we relax the requirement to all intervals of some fixed length (e.g. 7 days), I think that the only solution is a moving average. But a moving average isn't sufficiently sensitive to the most recent developments.

This suggests an asymmetric kernel, e.g. a triangle or half-Gaussian. I think all kernels satisfy some form of long-term AUC equivalence. But this doesn't address the day-of-week effects.

We need to send this problem for some research TLC.