cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Resolve performance issues related to data versioning (quick way) #158

Closed krivard closed 4 years ago

krivard commented 4 years ago

We did #24, but now directions and meta cache updates take forever and run out of memory. Let's fix that.

krivard commented 4 years ago

Probably a more substantial engineering effort to really address this.

K will follow up with George, Wael, Brian for a more detailed engineering convo about this

AlSaeed commented 4 years ago

A possible way to go forward for the direction update is to compute the update for partitions of the data by the time-series key (source, signal, time_type, geo_type, geo_value), updating the direction one partition at a time. An added advantage for this is that we can then schedule the direction update for partitions at different times of the day.

krivard commented 4 years ago

Results from meeting:

@AlSaeed to modify direction computation to handle only one partition per ingestion cycle, as described above. @korlaxxalrok to provide data distribution information to help determine an effective partitioning function. We expect this to reduce both memory and running time, at the expense of timeliness.

@melange396 to modify covidcast table definition as specified below, then modify (1) CSV import to fill the is_wip column appropriately, and (2) meta cache update to handle one (source, signal) pair at a time. We expect (1) to reduce running time, and (2) to reduce memory requirements at the expense of some additional running time overhead.

Table mods:

This is a short-term fix to get us back on our feet so we can take our time and get the next one right: the long-term plan specified in #161.

krivard commented 4 years ago

Also add int columns:

melange396 commented 4 years ago

whats the use case for those deleted/missing variables? will that come from a csv filename or a row value? or does it get applied in post-processing like the direction updater?

krivard commented 4 years ago

There are several circumstances that generate a missing value, and I'm still working on cataloguing them all (see also cmu-delphi/delphi-epidata#147 )

Some of them can arrive in the csv files during normal ingestion:

At least one of them cannot:

At some point we plan to have a unified API interface that invisibly handles both public + private queries; once that happens, I can also imagine returning "restricted access" if you ask for e.g. ZIP5-level data but don't provide an authentication key.

krivard commented 4 years ago

Partitioned directions are ready in cmu-delphi/delphi-epidata#161 for testing on staging

Meta cache update is nearly ready but failing some mystery unrelated tests; W to take a look.

krivard commented 4 years ago

Partitions are taking ~35 minutes apiece; test run without specifying the partition also took 35 minutes so not sure how to run all partitions.

AlSaeed commented 4 years ago

Partition 0 is failing in both cases, as one time series with signal 'wip_confirmed_7day_avg_cumulativ' doesn't have stdev value, which means it doesn't have any data before Constants.SIGNAL_STDEV_MAX_DAY. The fix will depend on the intended behavior:

krivard commented 4 years ago

Loads of options here, none of them particularly good.

Quick fix:

Long-term:

krivard commented 4 years ago

Meta cache revision is available in cmu-delphi/delphi-epidata#167, but includes the column renames so we'll need to get the disk space upped before we can apply it.

krivard commented 4 years ago

Working on adding a column to track which issue is the most recent.

Provisioning storage for further testing, including the column renames.

krivard commented 4 years ago

DB schema: signal data type fix is done (varchar32->64); column renames in progress.

Once that's done, we'll update the direction code to use the new names, then merge the meta cache update changes and test them.

krivard commented 4 years ago

Setup branches are ready to merge;

Direction update remains to fix all the tests; conflicts expected because we're doing two column adds from two different branches.

krivard commented 4 years ago

All staging PRs have been merged; next is to deploy to staging and test.

krivard commented 4 years ago

is_most_recent_issue update ran overnight

Remaining tests:

korlaxxalrok commented 4 years ago

The current lay of the land:

krivard commented 4 years ago

Assuming no large issues with csv uploading, this will work as the stopgap fix until we do the major table reorganization.

To deploy to main:

AlSaeed commented 4 years ago

The fill_is_latest_issue script is only meant to be run once to initialize the is_latest_issue column. The integrity of the column is then maintained by the updated insert_or_update_batch query.

For the performance of direction update, I suggest adding the following key, and benchmarking at least one partition: ALTER TABLE `covidcast` ADD KEY `by_is_latest_issue` (`source`, `signal`, `time_type`, `geo_type`, `geo_value`, `time_value`, `is_latest_issue`)

korlaxxalrok commented 4 years ago

Adding more detail to what a cut over might looks like:

Once all the changes are made we should be able to start Automation.

Follow on tasks would be to:

Once we know how long the is_latest_issue fill will take then we can extrapolate for production. Probably we have nearly 2x the rows there.

korlaxxalrok commented 4 years ago

Hmm... I ran the is_latest_issue fill again on staging and it took ~237 minutes. Assuming it did the same amount of work as the first time it ran we might expect it to take ~400 minutes on prod (336m rows in staging vs. 574m rows in production).

krivard commented 4 years ago

Still need to check:

2-day migration event sounds wise.

krivard commented 4 years ago

Evidence suggests we'll have to upgrade replicas as well, ideally the same night we upgrade prod. Will do it with a script so a buddy should not be necessary, but will keep us posted.

Estimated schedule:

krivard commented 4 years ago
krivard commented 4 years ago
krivard commented 4 years ago

Adding more storage to prod tonight before we do the table renames, code push, and issue fill (probably tomorrow).

korlaxxalrok commented 4 years ago

The storage has been added, so after Automation has a chance to right itself we can look at the final steps for this.

korlaxxalrok commented 4 years ago

@krivard I have the tables renamed/created for this task. Next up is:

The PR has reviewers, but I am not sure who might be available this week.

korlaxxalrok commented 4 years ago

@krivard Ah! I see, didn't notice those were just suggestions for reviewers. So maybe this is ready to go. I'm not going to just merge it, but when you have a chance to double check it in some capacity we can merge it.

krivard commented 4 years ago

Is that the branch currently on staging, and is staging working as it's meant to?

korlaxxalrok commented 4 years ago

Yeah, the branch has been deployed there and should have been what we tested with (87b1984).

Looking back, we manually tested CSV upload, direction update, and the cache update.

I think we've done what we can for testing. Our staging env is not a super robust approximation of production, but the code ran there. We don't run an identical Automation flow so outliers could exist.

I think we can merge, deploy, and move forward (carefully), but I can definitely run another batch of tests if we think that is valuable.

krivard commented 4 years ago

Okay, I've double-checked that the API on staging returns data, and the meta queries work as well.

I can merge now, or wait for the morning if we're worried it will fail overnight.

korlaxxalrok commented 4 years ago

I think we can go ahead and merge. I'll watch to see if we have any major issues.