Production for Backfill Correction

jingjtang commented 1 year ago

We already have most of the work done in covid-19(the private repo). The main goal of this project is to provide backfill correction for the values that are reported everyday.

Work left

[x] Move the pipeline from covid-19 repo to covidcast-indicators @nmdefries
[ ] Decision on the training frequency and training dates
[x] Generate historical backfill files for chng, quidel, etc and load to midas. @jingjtang
[x] Change/not change the path to stored backfill files
[x] Decision on the way to publish the output @RoniRos @krivard
- Currently, we have all the corrections stored in csv files
- Or make them new signals?
- Or add them as extra columns for the existing signals? (seems impossible?)
- Or publish the output to an AWS S3 bucket?

2022-10-06 - Engineering meeting notes

nmdefries commented 1 year ago

We could also publish the output to an AWS S3 bucket.

nmdefries commented 1 year ago

Decision on the training frequency and training dates

To start the conversation off on this, my understanding was that we'd train the models once a month on the first of the month at night (when we normally schedule longer-running processes).

krivard commented 1 year ago

make them new signals
- pro: easy for any epidata user to incorporate into their existing pipelines
- con: storage & performance impact on covidcast endpoint
- con: separate signals obscures relationship to uncorrected values
add them as extra columns for the existing signals
- pro: closely associates corrections with uncorrected values
- con: probably best to store these in a separate table, which would require a significant (3-4 weeks? similar to covid_hosp) development effort on its own, plus whatever handling in the covidcast endpoint to conditionally glue it on
- con: probably makes output incompatible with existing API clients. we could adapt the clients to handle it but then anyone who doesn't update gets a crash.
publish the output to an AWS S3 bucket
- pro: uploads are easy
- con: separate access from all other epidata. more work for users

RoniRos commented 1 year ago

We need to establish structured relationships between different signals, e.g. those related by JIT work. Backfill projection signals should be part of that. I will add this task to our task list. We need to decide the priority of such a task, and based on that, whether backfill-projection should wait for it. If it shouldn't (which is what I expect we might decide), we should use a temporary solution.

krivard commented 1 year ago

Proposed AWS S3 file organization:

Unique file per

source
signal
region
correction computation date

Then a single-region timeseries is a single file fetch.

Storage for this is going to get big pretty quickly, since we're saving the full correction history. To save space, consider:

gzip parquet format
limited precision
excluding replicated data

nmdefries commented 1 year ago

excluding replicated data

Katie, can you expand on this? Are you talking about diffing the previous day's and the current day's data to find changes only?

The compression and limiting precision points make sense.

RoniRos commented 1 year ago

The structure we are designing here for storing backfill projections is pretty much the same as that needed for storing forecasts, nowcasts, and backcasts. In fact backfill projection is essentially a backcast of a particular indicator. So I'd like to bring @ryantibs , @brookslogan and potentially other forecast-related folks into this discussion.

I think the relevant dimensions are:

Source+Signal
geo-location (aka "region")
reference date
as-of datetime (better than "computation date" because it could be computed retrospectively, as long as it uses the covariates as of that date)
quantile

To put these in S3 files, I agree it make sense to separate the data at least by:

Source+Signal
geo-location

and possibly also by:

quantile (because users may be interested is specific subsets of quantiles, like 50%, or (10%, 50%, 90%)).

This leaves a 2D table of {reference_date X as-of_datetime}. Since these data will be produced every as_of day, it makes sense to add as_of to the file name/identify, and make the file consist of ~60 values corresponding to lags from 0 or 1 up to ~60, corresponding to reverse-successive reference dates. Since we are going to have all ~60 values, we don't need to store the actual lag (and we can always represent a missing value with an extra separator).

Alternatively, we could store all the quantiles in the same file. But it can get messy if different quantile sets are produced at different times or for different indicators.

@ryantibs , @brookslogan : how are quantile forecasts stored by us? By the Forecast Hub?

krivard commented 1 year ago

make the file consist of ~60 values corresponding to lags from 0 or 1 up to ~60, corresponding to reverse-successive reference dates

does that mean we're okay with users needing to pull multiple files in order to build a time series covering the whole pandemic?

krivard commented 1 year ago

excluding replicated data

Katie, can you expand on this?

I do not mean diffing; that would be a last resort as it would require much more complicated data fetching capabilities than are easy to do in S3. It's possible, and we'd have help setting it up (it was recommended by one of the Amazon data teams), but it would take substantial effort.

I mean pulling any data that's the same for all rows of the file out into the filename or some kind of header, and pulling any categorical data with long human-readable names into an index file and referring to it by a numeric id instead.

krivard commented 1 year ago

as-of datetime (better than "computation date" because it could be computed retrospectively, as long as it uses the covariates as of that date)

this will create additional operational costs when source (covariate) data patches are applied, and the operational cost of data patches is already high. i'll follow up with you offline.

brookslogan commented 1 year ago

Trying to give some quick answers and think more about this later.

We currently store forecasts in a couple of formats, with no real care for saving space:

One RDS for each internal forecaster in our system. (Stored in S3 cache.)
One CSV for each submitted forecast.
~One RDS for all forecasts.~ Obviously not scalable, but at one point it worked for all our pseudoprospective forecaster candidates.

The Hub, at least in their GitHub, stores:

One CSV per forecast. Originally contained in the repo contents, now backed via Git LFS.
Not sure what they do in Zoltar.

RoniRos commented 1 year ago

make the file consist of ~60 values corresponding to lags from 0 or 1 up to ~60, corresponding to reverse-successive reference dates

does that mean we're okay with users needing to pull multiple files in order to build a time series covering the whole pandemic?

Possibly. Backfill corrections don't matter beyond a 60 day lag, and for any reference_date older than 60 days ago, they are not needed except for retrospective error analysis and training forecasting models.

More generally, here are the use cases I can think of for backfill-projected signals:

Displaying, in real-time, the most uptodate estimates available of the finalized values of the recent past: for a fixed signal, region, quantile and as-of_datetime, get a list of 60 projected values, where reference_date goes back from as-of_date to as-of_date-60, and in parallel lag goes up from 0 to 60. This would be the most common operational use of this signal. It will also be the most likely set of covariates for forecasting. In my proposal above, it will require a single file.
Showing a map of the signal value in, e.g. 50 U.S. states, using not the reported value but rather the most recently estimated 50% backfill-projected value. In my proposal above, this will require 50 files (or 3000 if we do counties). If you want to step the map forward or backward in reference_date, each such step will take another 50 (or 3000) files. Not great.
Demonstrating the accuracy of backfill projection: for a fixed signal, region, historical reference period (>60 days ago), and handful of quantiles, plot the (relative) error between the finalized value and the projected value as function of the lag. This requires 2D data, spanning lags and reference dates. So it would require multiple files spanning multiple as-of_datetimes. But this is heavy backend data analysis, so it's okay to require many files.
Train forecasting models that use the most recent available estimates as covariates. That's like (1) above, but for all available as-of_datetimes. Acceptable for training, I think.

So (1) is 1 file, and (3) and (4) can afford to take more time. My remaining concern is the map (2). To solve that, we could: A. store all states/counties in the same file, increasing the size of the file. Do we do that for any of the other signals? B. store (the 50% quantile only) a second time, grouped by regions.
C. store it as an additional indicator in Epidata. Think of it as another (corrected) variant of the original signal.

krivard commented 1 year ago

store all states/counties in the same file, increasing the size of the file. Do we do that for any of the other signals?

We store contingency tables for CTIS as static files, but that's the only one I'm aware of. @nmdefries can comment on their format and makeup.

RoniRos commented 1 year ago

Upon reflection, I am partial to (B) or (C) above. Since it's only one of the O(10) quantiles, storing it a second time will increase space by only O(10%).

Alternatively, we can ignore this problem until we see actual demand for such maps. It's good enough that we have a solution ready.

RoniRos commented 1 year ago

One CSV for each submitted forecast.

One CSV per forecast.

Does "per forecast" mean per computation_time/as-of_time, but for all regions and forecasting targets?

~One RDS for all forecasts.~

Does "all forecasts" mean forecasts for all regions, and/or all targets? Or also all forecasting_times?

brookslogan commented 1 year ago

Yes, "per forecast" means per model & forecast_date/as_of, but containing data for all regions, targets, & quantiles. We'd probably also make this per geo_type and time_type if we were really dealing with multiple of those. (We do calculate national from state, but as a fixed post-processing step; we don't do national-level analysis.)

"all forecasts" is everything: all models, forecast_dates, regions, targets, & quantiles. (Again, if we had multiple geo&time types, it'd probably be one file per type combination.) But we've tried so many models that this is slow and uses up too much RAM; currently, we get by by loading only a subset of models of interest.

Our primary/sole use case above --- pseudoprospective & prospective forecast evaluation --- looks like use case 3. And while we haven't really done this with covid & influenza hospitalization forecasting, we may also match use case 1 for post hoc investigation & debugging of bad forecasts or forecasts near data anomalies.

For 2: what lag would this be? At least from the forecasting perspective, I would be less interested in the latest and more interested in a set of maps for each lag. Not sure if you were thinking of forecasters as the users here though.

RoniRos commented 1 year ago

Thanks @brookslogan . I agree that for evaluation you would use case (3) and maybe also case (1) of the forecasts rather than of the backfilled covariates). For training forecasters, I assume you might want case (4) of the backfilled covariates.

For 2: what lag would this be? At least from the forecasting perspective, I would be less interested in the latest and more interested in a set of maps for each lag. Not sure if you were thinking of forecasters as the users here though.

Good point. I wasn't thinking clearly. I was indeed thinking about real-time PH users. They would typically want the most uptodate estimate for today, and possibly to scroll back to the most uptodate estimates for any past reference_date. That means the lag would change as you step through reference_date; what remains fixed is the as-of, which will be 'today' (aka latest). This corresponds roughly to all the 50% quantile values produced today by the backfill projection code.

@krivard Do we know the space/time tradeoffs of large files vs. small files in S3? E.g. a linear time/space complexity model of file size?

nmdefries commented 1 year ago

A. store all states/counties in the same file, increasing the size of the file. Do we do that for any of the other signals?

We store contingency tables for CTIS as static files, but that's the only one I'm aware of.

Each CTIS contingency table contains all signals of interest, each as an additional column, for all geo values (Texas, California, etc) of a given geo type (state, e.g.) for a particular time period. So we have one file for each time period + geo level. The contingency tables aren't versioned, so if we have to regenerate data we overwrite the old file.

Having each file contain multiple value columns is pretty inconvenient. If you need to regenerate a single signal or backfill a new signal that you want the history for, you spend a lot of time computing data you already know. It's also slower to fetch data if you only want to process a single signal, e.g. for plotting.

krivard commented 1 year ago

Do we know the space/time tradeoffs of large files vs. small files in S3? E.g. a linear time/space complexity model of file size?

in terms of download speed? not precisely, but based on general principles i'd expect there to be some amount of per-file overhead. if you want to know for sure and are willing to wait for results we can run an experiment. what factors are you thinking?

RoniRos commented 1 year ago

I was thinking of files either consisting of 60 values (as per my proposal above), or else lumping together all regions in a geo-level (so 60 x ~50 for U.S. states, and 60 x ~3000 for counties).

Use case (1) typically needs only a single region.
Use case (2) needs all regions (the map). Use cases (3) might need either, depending on whether you analyze accuracy for a specific region or on average for all regions. Use case (4) typically trains only within a single region. In any case, use cases (3)+(4) can afford lower response time.

If you have a clear sense of which is better, or another solution you prefer, I am fine just running with it.

Estimating the per-file and per-byte access cost might be generally useful beyond this question.

brookslogan commented 1 year ago

(B) / (C) sounds like a reasonable approach.
If we're trying to make a general approach that could be applied to all data sets rather than just these backfill corrections, then there's the additional complication of larger-lag updates. For these backfill corrections, we can set a max lag for which corrections will be published, and up until that lag, we expect values to change between versions rather than stay the same (as_of t is basically issue t). But for other data, there may not be a reasonably small maximum lag, although, for many sources, we instead expect values to stay the same. So for something more general, we may need to think about, e.g., storing issue data + additional views to make up for not having as_of data. Not sure if that's something we want to try to address ahead of time when it doesn't apply here.

RoniRos commented 1 year ago

I think we are aiming for a general approach for signals that have significant backfill, but only those where the backfill is dense enough that he statistical model Jingjing developed is reasonable.

@brookslogan Can you please give concrete examples of the kind of signals, or the kind of hypothetical backoff, that you are concerned about?

brookslogan commented 1 year ago

I think I just misread "for signals" etc. to mean for the signals themselves rather than the backcasts. I think the proposed approach is fine for any sort of backcast, nowcast, or forecast that outputs a manageable set of behinds/aheads. What I was concerned about is taking this approach and applying it also to raw signal archiving, where changes don't necessarily occur in a manageable set of behinds/aheads; e.g., we don't have a hard guarantee that ILI, JHU-CSSE case count reporting, or the CHNG raw data for some day two years ago won't be revised tomorrow, so there might be some extra trouble when thinking about raw signal archiving. But this might be off topic.

cmu-delphi / covidcast-indicators

Production for Backfill Correction #1700