[fb-survey] synthetic dataset and unit tests

krivard commented 4 years ago

[x] spec out what kind of data is needed
[x] generate data or adapt existing data to be delphi-safe
[x] complete tests

capnrefsmmat commented 4 years ago

See https://github.com/cmu-delphi/covid-19/pull/73

I'll leave this open for @statsmaths to note any additions or changes that have to be done to the synthetic data, and to track the cleanup and testing of the Facebook code.

krivard commented 4 years ago

Collecting here a list of cases the code habitually needs to handle that won't necessarily get caught by the synthetic data:

Should accept multiple --full arguments
Should only keep the earliest record for a token when sorted by StartDate
Should merge --full arguments with a savefile when --load is specified, keeping the earliest record for a token when sorted by StartDate, and excluding that record from recoded.cleaned.recorded.partA iff it was originally excluded (debugging this was no fun at all)

Integrity checks can't be run with any kind of synthetic data. They include:

Make sure the local files in git are up-to-date with what's currently in the API
Make sure the updates to local files (we get ~4 days of backfill with a small number of respondents) are small
Make sure there are no big jumps in API columns vs what's currently in the API
Make sure there are no big jumps in effective_sample_size vs what's currently in git

We used to need to handle files that were encoded in Mountain time instead of Pacific, but with the Qualtrics API that's not necessary anymore. It should stay in, just in case anyone ever needs to load an old file (generated before 20 April), but it's not a testing priority.

statsmaths commented 4 years ago

I have had a chance now to starting looking at the synthetic data and existing code. There is a lot of code in the current pipeline that has been copied and slightly changed (with little to no comments). This makes it nearly impossible to try to reorganise the way I did for JHU and GHT, namely refactoring the code into smaller functions and cleaning up as I went along. It seems like a better approach will be starting over with a new code base, and borrowing existing code chunks as possible. In order to do this, I need to understand what the code is supposed to be doing. Note that I am (at least currently) only attacking what's covered in the R code inside of the covid-19/facebook directory and not consider the pull of data from the directory covid-19/qualtrics-cron.

statsmaths commented 4 years ago

Here is my current understanding of the code and data:

The input dataset processed by the R code has the structure of the synthetic data constructed here. This is a csv file with additional information pushed into the header. Each record has a field token that we use to match with the weights from Facebook and the field A3 captures the self-reported zip code. Currently we are interested in the responses A1_1 through A1_5, which are of the form "Does someone in your household have this symptom" and the data are recorded as 1="No", 2="Yes", and NA="Did not answer".
Our main variables of interest are CLI (does someone have a fever and one of: cough, shortness of breath, difficulty breathing) and ILI (does someone have a fever and one of: cough, soar throat).
There are two additional files that give "weights" for each of the households.
The output variables are given at four geographic levels (hrr, msa, county, state), given as a raw and smoothed version, and given as both weighted and unweighted forms. That yields a total of 4x2x2x2 = 32 total indicators.

Here are the outstanding questions I have and was not able to find in the code after looking at it for the better part of the day:

Is there something incorrect in my high-level outline above?
What method do we use to go from a pair of vectors (responses of CLI or ILI, weights) for a single geographic region to an estimated value, standard error, and sample size? Is the un-weighted version the same as the weighted version with uniform weights, or is there
There are several date fields in the dataset. Which one is used to assign a survey to a specific date in time?
What is the difference between the two sets of weight files? When do we use one and when do we use the other?
How is the smoothed signal computed? Does it literally smooth the raw signal, or does it require going back to the raw data?
I see the metadata files in "geographical_scope" and lots of code, but am lost in the weeds. What's the general strategy for aggregating zip code up to the other geographic levels? Summing raw counts? Population weights? A particular pre-computed transition matrix?

Apologies for the all the questions. I've done my best going through the code, but I am still having trouble with some of the details. If there is some other documentation no in the main GH that would answer some of these, please point we to them as well!

statsmaths commented 4 years ago

Working on creating a single script that replicates a cleaned-up version of main logic of the existing code. Currently committed on the fb branch with: 4fc709b649a1725096aa1ad6d34c5bcc45b18c16. See the file here for the current code: test_script.R

krivard commented 4 years ago

Monumentous!

Some additional weird logic: Currently I call run.R three times (once for each of three different output sets), and while we always start at 00:00:00 and end at 23:59:59, each output needs a different date range (it's gross, i know). Taking the run from 2020-05-21 as an example:

--output	--start	--end	--load?
cids	2020-05-20	2020-05-20	no
individual	2020-05-14	2020-05-20	no
covidalert; archive; community	2020-04-06	2020-05-20	yes

Why this is necessary:

The CID lists have to bracket a single day of data
The individual response sets need to receive small amounts of backfill for 4 days but then should level off; we use the extra days to double check there are no more changes
The smoothed aggregations have to run from the beginning of the study or else the figures diverge from what's in the API

We can probably merge (1) and (2) together by just saying only output CIDs for the most recent day in the range. Then if we solve (3), we can put everything on the same 7-day range. If not, we can probably do a similar thing with "only output individual responses for the last seven days".

We have occasionally needed to do mass re-generations of cumulative CID lists for debugging purposes, so whatever we do should permit control through params.json even if there is a standard configuration for the normal automated behavior.

Small nit: you probably meant name instead of "full" here:

write_cid <- function(data, name)
{
  fname <- sprintf(
    "cvid_cids_%s_response_%s_-_%s.csv",
    "full",
    format(start_time, "%H_%M_%Y_%m_%d", tz = tz_to),
    format(end_time, "%H_%M_%Y_%m_%d", tz = tz_to)
  )

statsmaths commented 4 years ago

Thanks for the additional information and catching the bug in write_cid. I was confused about where the past data was being merged in for the smoothing step; good to know that I was not missing anything but it simply has not been implemented. I think we can build in some logic to avoid having to run the entire thing three times, while still making it flexible for backfill. A brute-force solution would be to include three different start dates, but as you suggest perhaps something more elegant can be done as well.

Likely I am misunderstanding something in the smoothing logic, but I would have thought that was only a need to go back 7 days given that it seems that smoothing only occurs within a 7-day window.

I pushed a revised version of my single script that now handles the smoothing of the signal, grabbing the community data, and storing the archive. From there, it should be close to ready to wrap up into documented and tested functions. It will need some very close code review, as it is not as easy for me to directly verify that it produces output consistent with the existing code.

statsmaths commented 4 years ago

I pushed a commit today that has a packaged version of the refacted/rewritten facebook survey code. You can see the current structure here.

Generally, it's well covered with both unit and integration tests. For integration tests, I have no real data to compare against, so I made another very small set with only 20 responses put into just 2 zip codes to try to sanity check that the results are reasonable. While its possible that there are some edge-case bugs, I believe that the current code is mostly accurate for the archive, weights, and individual outputs. For the community and covidcast outputs, I believe that the raw and weighted versions are also correct.

There are now two blocking things that I am not entirely understanding. Namely:

I understand the standard binary point estimate and standard errors produced for the community survey. However, I am not familiar with the methods being used for the household counts (or at least, do not recognise the method from the code). I believe my code's logic follows what's in the current script, but hard to feel very confident without knowing what the output is supposed to do. This is particularly true when introducing weights and smoothing. The current implementation has a lot of intermediate variables that are created—some do not even seem to be used anywhere—that make parsing the method even more complicated.
For smoothing, what is supposed to happen when there are missing values (no data in a region for a particular day). Similarly, how is smoothing handled when the smoothing window goes beyond the time-frame of the existing data. This is even more confusing when there are sample weights and I'm not sure what is supposed to be done.

krivard commented 4 years ago

The household CLI and ILI is computed using importance sampling; more information may be found here:

Master COVIDcast methods doc
Facebook methods doc (older)
https://statweb.stanford.edu/~owen/mc/Ch-var-is.pdf; particularly 9.3: Importance sampling diagnostics, equation 9.13

For smoothing, in theory we're taking the weighted average of all data collected in the last seven days. If there is not at least one day of raw data in the last 7, we don't report a smoothed value for that day.

statsmaths commented 4 years ago

Thanks. Trying to read through these docs, but I'm still somewhat confused though (sorry!). I actually find the documentation you sent fairly straightforward, but I'm not sure how it completely explains the code.

Specifically, there is a lot of code that seems to be focused on something related to "mixing". For example, in aggregations-setup.R on lines 161-187. I can't figure out where this switch statement is described in the documentation or what I should expect it to do.

Searching through the three documents you linked did not surface anything with a similar name to "Mixing Coefficient".

Thanks again for dealing with all my questions!

krivard commented 4 years ago

Ah right, mixing -- sorry about that. The mixing code ensures that if some household is responsible for more than 1/100 of the score for the whole dayXregion bucket for whatever reason, then for small dayXregions (n<=100) we replace the normalized weight with 1/n, and for larger ones we use a number in between 1/n and the normalized weight, where "in between" depends on exactly how large the dayXregion is versus the maximum normalized weight in the dayXregion.

This should 100% be added to the signals explanation in the PDF -- @ryantibs @brookslogan I know documentation is no one's priority but this is the 3rd time I've had to explain this part of the facebook process (to umd twice, to lester, and now to taylor) and my stats knowledge is decidedly underpowered for the job. Can one of you either update the signals doc, or coordinate with me to schedule a call and I'll take dictation?

statsmaths commented 4 years ago

Great, glad I was not going crazy.

Currently this is not implemented for the Facebook Community surveys. That makes sense: we were previously not creating weighted versions and it's just a binary response, so no household can be more than 1/100 of the score if there are more than 100 households (the current reporting cut-off). I was asked to build in a weighted version of the community survey. Should mixing be implemented there as well?

krivard commented 4 years ago

Oof, probably yes, though I'm not sure if the change in statistic means any of the mixing thresholds need to change.

ryantibs commented 4 years ago

I wasn't even aware of this! @brookslogan It'd be ideal if Logan could just directly put a new part in signal_descriptions.tex to explain this.

statsmaths commented 4 years ago

I just pushed edits to the refactored facebook code. I believe that it now covers everything other than megacounties. Though as mentioned in my other notes, these will need to be carefully checked and possible debugged by someone with access to the full dataset.

@ryantibs @brookslogan — Going through the documentation, there are actually a number of differences between / details missing in the current documentation for the facebook signal. For example:

The documentation describes a standard frequentist estimator of the community ILI signal, but the code implements a Bayesian estimator with a Beta(0.5, 0.5) prior. The standard error is, however, still in terms of the frequentist estimator.
The code uses Jeffrey's correction to the standard error of the HH ILI and CLI signals, but this is not documented anywhere, or justified in any code comments.
The documentation doesn't mention the "weight_in_geography" weights and how they are treated. This is confusing because at some points in the code there can be 4 or more different sets of weights all mixed together.
The documentation does not document how sample size is computed. With all the different types of weight floating around, this get quite complicated.
As mentioned above, there's no description of the mixing logic to handle large inverse probability weights.

capnrefsmmat commented 4 years ago

Yes, the Jeffreys SE was applied to many of our survey estimators, not just Facebook, primarily because the Facebook household signal occasionally reports rates of exactly 0 with corresponding SEs of 0, which made for very misleading time series plots. We decided to use the Jeffreys approach to avoid this. Logan then did simulations and found that the Jeffreys estimate of proportion had worse properties for the household signal, but the SE was fine; I think there's a separate file somewhere containing his simulations. We changed all other proportion estimators to be consistent with this, even though most of them never report estimates of exactly 0.

I definitely agree this should be clearly documented, though.

brookslogan commented 4 years ago

Recapping message regarding FB HH signals: we wanted a unified approach to adjusting se's across different sources (to at least avoid se=0), but FB HH is a little different from others based on binomial proportions. Alex suggested the moral equivalent of the Jeffreys prior-based adjustment was to add a single pseudo-observation of 0.5. However, simulations suggested that this would introduce significant MSE in the mean estimates for "small" numbers of households, and that a significant portion or most county estimates were based on "small" numbers of households. So we went with the following:

use the existing (weighted) sample mean estimate as-is, with no adjustment
add a pseudo-observation of 0.5 (with a weight to count it as a single effective observation) when calculating the se. (In the formula for se, use residuals measured against the unadjusted mean estimate.) This avoids se=0, removes the need for use of UseSDPercent{X}LI, and improves the RMSE of se estimates (where se!=0).

The (uncommented) simulations are at cmu-delphi/covid-19/facebook/sensor/logan/se_adjustment.R.

capnrefsmmat commented 4 years ago

The good news: with one simple typo fix (b1f2281, which probably should get a unit test too), the pipeline can run and produce all the unweighted output files. The output is not obviously insane, but I have not checked it against the real pipeline.

The bad news: The weighted output pipeline does not run. Because I'm only using one weights file, most responses have no weight, and mix_weights does not account for the possibility of weights being NA. Also, it took the pipeline 20 minutes to get to this point, and it still hadn't gotten to any aggregation above counties.

I'm not sure why this failure doesn't occur with the synthetic data, since I deliberately included responses with no weights. I'll test it out more.

capnrefsmmat commented 4 years ago

I was able to make the pipeline faster with the tiny tweak in 2cf180c; see details in the commit message. There's still plenty of room to improve. Fully 50% of the code's time is spent inside past_n_days, mainly in ymd parsing dates. If there's some way to parse all the dates once, or use Dates and never parse, the pipeline could be much faster.

Before you go and optimize it, @krivard is considering using this as part of the INI hackathon, since it's a nice discrete problem for them to solve and could get them to do other interesting things for us.

The pipeline failure was a mistake in my synthetic data generator. I intended it to fail to provide weights for some tokens, but (I think by luck) it provided weights for every token. Hence the new pipeline failed whenever a weight was not available and it was producing weighted output. I fixed this in b6e08e1.

I'm still waiting for the pipeline to finish running, but it has successfully produced some of the weighted output files, and they are not obviously crazy.

@krivard, before you check if it produces matching output, you have to set "debug" to false in params.json, since debug mode seems to drop the output filtering threshold to a sample size of 2 instead of 100, as well as tweaking some other constants (see read_params in utils.R).

One other code comment for @statsmaths: create_data_for_aggregatation and filter_data_for_aggregatation are spelled incorrectly.

statsmaths commented 4 years ago

Thanks for taking a look at the code and seeing how it runs on the real data! Just to re-iterate what I said yesterday, I was handing this off not because I thought it was production-ready. Rather, I had hit a wall where I needed some help from someone with access to the actual data. Synthetic data only goes so far because I was not able to get the current code base to run over it. Not at all surprised there are some bugs and inefficiencies.

Just a few comments about the current code:

If the largest inefficiency is in past_n_days, that's an easy fix. It's only used to figure out which days in the past to grab for pooling. Right now it is called for each unique region (which is completely pointless). It could either be replaced with a look-up table, cached with memorisation, or just pulled out of the loop.
Currently, the only tests that I wrote are unit tests for the low-level helper functions and integration tests (in test-run.R) that verify reasonable results on a very small test dataset. I have not yet done unit tests on the intermediate-level functions.
I will be curious to see how well the current code reproduces the existing data. Is it "correct" for the simplest case of unweighted, raw community values? If it is correct for some geographic levels or particular geo_ids and not others, there's likely either an issue with the geo-lookup tables or a rare edge-case regarding the data filtering. If it is highly correlated but never quite correct, it may be a filtering issue. Once we can verify one of them works as expected, it should be easier to track down any remaining bugs/misunderstandings.
Good catch. Yes, I am a terrible speller.

Maybe we can discuss on the call today the best way forward. Happy to talk independently through the code as well, if that would be helpful.

capnrefsmmat commented 4 years ago

I think the inefficiency is not just past_n_days but everything in the loops in summarize_binary and summarize_hh_count; all of the filtering to calculate index must similarly be done to every row in the table for every loop iteration.

I think a faster approach would skip expand.grid. Instead, create an empty data frame in advance, then manually loop over days and geo_ids. Once per day, calculate the matching dates; once per geo_id, subset the data frame to just the matching rows. This can bypass a lot of the filtering and indexing going on.

Now that it runs, @krivard could set params.json to just generate data for a single day and a single type (say, community), with debug set to false, and see how the output looks in comparison to the original pipeline.

statsmaths commented 4 years ago

That makes more sense, as I would not have expected past_n_day to be particular bottleneck.

I don't think the expand.grid itself is the issue (it's just creating the empty data frame you are talking about). But I do see how it could be slightly faster to use a double loop over date/geo_id rather than over the rows of the empty data frame. That would allow moving some of the logic into the intermediate part of the loop.

The real challenge to me are the mixing coefficients. If it were not for these, we could compute summary statistics, both weighted and unweighted, for each combination of (zip code, day) and base everything on those. I was trying to think of a way around this, but couldn't come up with an obvious (and not overly involved) solution.

krivard commented 4 years ago

I've completed a run to compare output on data from 2020-04-06 to 2020-04-10, and the results are quite a bit off even just by what set of counties gets output:

$ wc -l cvc-indicators-version/20200406_county_raw_cli.csv covid-19-version/20200406_county_raw_cli.csv
   31 cvc-indicators-version/20200406_county_raw_cli.csv
  270 covid-19-version/20200406_county_raw_cli.csv
  301 total

Sample diff (another three screens of missing counties not included here)

diff -y cvc-indicators-version/20200406_county_raw_cli.csv covid-19-version/20200406_county_raw_cli.csv
geo_id,val,se,sample_size                                       geo_id,val,se,sample_size
04013,44.3572906,1.4498736,548.8718743                        | 01073,1.4132586,0.7604313,233.526
04019,52.6954747,2.7525858,164.9430944                        | 01089,0.7408512,0.5982235,174.4847
06037,43.7372538,2.4792932,140.5401676                        | 01097,1.09375,0.7034121,160
08005,39.0494591,2.7415579,110.4806125                        | 02020,0.3683377,0.4558624,180.993
08031,55.3128546,2.530657,131.3631438                         | 04013,1.0746494,0.2098514,1915.622
08041,40.251613999999996,2.7959407,120.0561602                | 04015,1.3071895,1.1199949,102
09009,46.6691439,3.2180663,101.999963                         | 04019,1.2135946,0.4110941,537.7963
15003,41.5030172,3.1969783,110                                | 04021,0.3576469,0.3563916,178.3127
17031,48.3062743,1.806412,357.2605875                         | 04025,0,0.4246168,115.4545
18097,49.0224168,2.8644961,133.7436918                        | 05007,0.8661774,0.6275794,118.8501
21111,47.9458318,3.2493972,116.1586522                        | 05119,0.314969,0.399733,188.9697
25017,47.8523074,2.7616648,148.647317                         | 05143,1.7282808,1.2799919,103.0884
26125,51.7855901,3.1446726,122.472028                         | 06001,0,0.3691187,133.079
26163,42.9263965,2.3757308,186.0581247                        | 06013,1.1243836,1.0415355,105.9102
27053,52.1378772,2.8961085,138.4090241                        | 06037,0.8330712,0.3660343,623.3862
29095,47.1695129,3.2910707,104.0123169                        | 06059,1.2986553,0.7705809,230.5052
29189,46.5558509,2.9522391,125.8928667                        | 06065,0.6824326,0.5655071,203.1644
32003,44.7634749,2.0542837,278                                | 06067,0.5718681,0.6326293,174.8062
35001,50.1359557,3.2817277,106.9704347                        | 06071,0.5871644,0.6507125,167.1167
39049,53.2175648,3.0944069,112.1256353                        | 06073,0.4973844,0.337318,318.0391
40109,45.5708293,3.0880728,114.8838432                        | 06085,0.3223091,0.4476771,154.1283
40143,50.7868425,2.902578,121.8608058                         | 08001,1.0730273,0.5142252,268.6464
41051,44.8330718,2.7092214,126.89848310000001                 | 08005,1.9754036,0.6413415,319.8024
42003,49.28301,2.8914538,101.2077825                          | 08013,0.7446767,0.5845977,168.4264
48201,38.9161671,3.0010727,109.6267507                        | 08031,1.831484,0.5693519,409.6245
49035,46.9128873,2.5511804,175.9871201                        | 08035,2.8148411,1.0407039,209.7201
53033,51.2549591,1.7877749,294.6656528                        | 08041,1.3853901,0.5192886,404.7802
53053,44.7066819,2.9447276,112.08183820000001                 | 08059,1.0846624,0.5252888,298.0171
53061,45.6034937,2.3997136,125.975825                         | 08069,1.0261345,0.6658186,191.8521
55079,47.3288811,3.2039896,121.9999487                        | 08101,1.5888499,1.213278,104.8403
                                                              > 08123,1.9584283,0.8917057,160.79
                                                              > 09001,1.1411065,0.474968,331.9404
                                                              > 09003,0.7673776,0.4099026,347.018
                                                              > 09009,1.5786981,0.6288789,342.9508
                                                              > 09011,2.9359863,1.3433356,138.2995
                                                              > 10003,1.4772609,0.76354,247.4745
                                                              > 10005,1.8899608,1.2920725,113.3599
                                                              > 11001,0.5722461,0.3934854,233
                                                              > 12011,1.4157706,0.7486386,186
                                                              > 12031,1.4362475,1.0040805,121.6342
                                                              > 12057,0,0.2696038,182.0031
                                                              > 12071,0,0.4157254,118.3756
                                                              > 12086,0.1650167,0.2951203,201.9998
                                                              > 12095,0.4707617,0.5201097,211.9404

To help reduce the need to back-and-forth this, I've run the synthetic data through the current pipeline, using the covidcast-indicators debug settings for the various thresholds, to get you a set of gold results to compare with... obvious in retrospect we should have done this at the beginning. ~~Here is a tarball of the gold results on the synthetic data.~~

statsmaths commented 4 years ago

Thank you very much; that's exactly what I needed. I've been playing around with the data now that I have something to test against and have identified several issues that were likely behind some the largest difference between our estimators. For one thing, I had the definition of the A1_ values reversed, which is why my estimates are much larger than yours (N.B: there are not yet pushed; just locally fixed on my end at the moment).

And apologies, did not mean to close the issue. Will update with questions as I continue to work through the differences.

statsmaths commented 4 years ago

After going back and forth over the data all afternoon, I think I have a pipeline that is much closer to reproducing the original results. At the state and HRR levels, I can approximately reproduce the "raw_ili", "raw_cli", "smoothed_cli", and "smoothed_ili". The values and returned regions are exact; the standard errors and sample sizes are within a factor of 0.5 from one another (most are exact matches, but there are some small differences). Most of the county and MSA regions are exact matches for the raw data, but the values start diverging for the smoothed estimates.

I am starting to think that there may be some difference between the county weights that I am using and the ones that were used to produced the gold standard. Are the files in "covid-19/geographical_scope" the correct files? I am finding some difference between the sample sizes in my output and the gold results that I cannot wrap my head around given the data. Specifically, there are a few raw, unweighted gold standard sample sizes that are non-integers, but manually looking at my geo data all of the corresponding zip codes lie entirely in one state.

capnrefsmmat commented 4 years ago

The files in covid-19/geographical_scope should indeed be the exact ones used by the existing pipeline.

Are you saying that when producing a state estimate, the unweighted code produces non-integer sample sizes, even though all the zip codes reporting for that state are entirely contained within the state? If so, can this be affected by the mixing logic?

statsmaths commented 4 years ago

Yes, exactly. Here's the gold version of the ILI state-level data for Virginia (sample size is 51.6715):

[local] $ cat receiving_gold/20200509_state_raw_ili.csv | grep "va"
geo_id,val,se,sample_size,effective_sample_size
va,0,0.9451159,51.6715,51.903563

But when I pull from the raw data, all of zip codes in the dataset from that time period at entire within Virginia (at least to some rounding error):

library(readr); library(dplyr); library(stringi); library(delphiFacebook)

# create crosswalk file zip5 => state
cw <- produce_crosswalk_list("static")
state_zips <- cw$state[cw$state$geo_id == "va",]

input_data <- read_csv(dir("input", full.names = TRUE))
input_data <- input_data[seq(3, nrow(input_data)),]         # the first two rows are junk
input_data$day <- stri_sub(input_data$StartDate, 1, 10)     # TZ is already in Pacific

input_data <- arrange(input_data, StartDate)
input_data <- filter(input_data, !duplicated(token))

state_data <- input_data[!is.na(match(input_data$A3, state_zips$zip5)),]
state_data <- filter(state_data, day == "2020-05-09")
state_data <- left_join(state_data, state_zips, by = c("A3" = "zip5"))
range(state_data$weight_in_location)

[1] 0.9999 1.0000

My understanding of the mixing logic is that it should only effect the unweighted responses. Any idea what might be going on here?

krivard commented 4 years ago

My understanding of the mixing logic is that it should only effect the unweighted responses

False -- Mixing is done on both weighted and unweighted signals. For unweighted signals, the importance weight for each survey response is set to 1, and if the zip is entirely within the region being aggregated the location weight is also 1, but then the HouseholdMixedWeight (=importance weight times location weight) is normalized before it is put through the mixing logic. Whether or not it's affected by mixing will depend on your s_weight setting.

Mixing should not affect sample_size though, since it's derived from the location weight, not the mixed weight.

I'll take a look at that example and see what I can pull out from the reference implementation.

krivard commented 4 years ago

This response appears to be the culprit:

  ResponseId        WeightInLocation ZIP5 
  <chr>                        <dbl> <chr>
1 HbywejwldJxxKAoEC            0.672 24523

> state.membership %>>% dplyr::filter(ZIP5==24523)
# A tibble: 2 x 4
  ZIP5  LocationCode LocationName WeightInLocation
  <chr> <chr>        <chr>                   <dbl>
1 24523 statefips51  va                      0.672
2 24523 NA           NA                      0.328

$ grep "^24523" 02_20_uszips.csv 
24523,51019,Bedford,428,49004,20085,Bedford,VA,Virginia,TRUE,37.34003,-79.5221,,32.5,"{'51019':67.17,'51515':32.83}",Bedford|Bedford,51019|51515,FALSE,FALSE,America/New_York,573,ROANOKE-LYNCHBURG,31340,"Lynchburg, VA (Metropolitan Statistical Area)"

$ grep "^[^,]*,51515" ../../geographical_scope/02_20_uszips.csv
$

So it looks like the zip code maps to two counties, one of which is in theory in virginia (51 FIPS prefix) but is not listed in the mapping file. We know that our zip->county mappings are a little outdated (#35) but it's not clear what the actual right thing is here -- clearly the new code takes the state prefix to mean "it's 100% in virginia so 🤷‍♀️" while the old code says "it's 32% somewhere that does not actually exist, so there's no way to know whether they live in virginia or not 🤷‍♀️".

For the sake of finishing the debugging, I'd recommend matching the reference implementation for now; once we know we can match it we can either revert to believing the FIPS prefix and record the change in the release notes, or just wait until #35 fixes everybody at once.

(also worth asking what the other sensors are doing with bogus FIPS codes; consistency across data sources would be nice)

statsmaths commented 4 years ago

Thanks! Yes, making that change seems to fix the original issue. However, I would argue that my original code is correct and that the current implementation is a bug. The file 02_20_uszips.csv is a lookup table for zip codes; when a zip is in more than one county/state it's mapped to the county and state that occupies the largest proportion of the zip. While rare, there are some county-equivalents that are not the majority part of any zip code, and so you cannot just use to the zip lookup table as a proxy crosswalk for county=>state. However, I do think its best to try to match the current implementation in the interest of time.

Regarding the comment about the mixed weights, yes 100% agree. Just a problem with there being an overloading of the term "weighted". I only meant that if all the weights (both inverse probability and geographic) are uniform, there should (and really can't) be any mixing.

Unfortunately, I've hit another discrepancy that I can't figure out how to fix. It's a bit easier to show, though, because you only need to look at the output. Here is one row in the gold standard output for HRR=270:

[local] $ grep "270," receiving_gold/20200510_hrr_smoothed_cli.csv
270,2.7777778,4.1926104,12,12

My code agrees on the value and sample sizes (yay!), but has a slightly different SE. I am fairly certain that the raw data for this HRR has 12 households where one household of 3 people has 1 person with CLI. Here's the calculation I have for the standard error:

x <- c(1, rep(0, 11))
n <- c(3, rep(1, 11)) # not sure of n_i, i>1, but shouldn't effect calculation
m <- 12

val <- 100 / m * sum(x / n)
se_raw <- 100 * 1 / m * sqrt(sum((x / n - val / 100)^2))
se_jeffrey <- sqrt((50 - val)^2 + m^2 * se_raw^2) / (1 + m)

cat(sprintf("val = %f, se_raw = %f, se_jeffrey = %f\n", val, se_raw, se_jeffrey))

val = 2.777778, se_raw = 2.659520, se_jeffrey = 4.384249

That's my understanding of what the computation is supposed to be according to the documentation. You'll see it matches the val exactly and the SE is close but too far off enough to not be due to rounding error. This is raw HRR, so there should not be any pesky weights messing something up (plus, the reported sample size is an integer). I'm worried that I am missing something about the "correct" definition of the SE.

krivard commented 4 years ago

It looks like your se_raw is off? I get 2.27 in the reference implementation. Note that we take the raw daily aggregations first, then use skiproll to pool the results, including the standard deviation:

SDPercentCLI = sqrt(pmax(0,zoo::rollapplyr(SDPercentCLI^2*RowWeight^2,k,sum,partial=TRUE,na.rm=TRUE)/WindowWeight^2))

where RowWeight is the daily sample size and WindowWeight is the 7-day sample size.

I think what that means is that instead of doing

(x_i / n_i - val / 100)

for the i-th response in the window whose average CLI is val,

we're doing

(x_ij / n_ij -  val_j / 100)

for the i-th response on the day whose average CLI is val_j. In this instance, you're subtracting (0.333 - 0.278), where we were subtracting (0.333 - 0.111), since the household with 1/3 CLI occurred on a day with 2 other households, both 0. This does sound like a bug according to the declared smoothing method, but I also don't know what the correct description of this implementation would be, or whether there's a way to re-center an existing se computation around a new mean after discarding the individual sample data, as we do in the reference impl.

Here are the intermediate values for this HRR according to the reference implementation, starting from individual responses:

? individual records 
# A tibble: 12 x 5
# Groups:   TimeInterval, LocationCode, LocationName [3]
   LocationCode LocationName TimeInterval HouseholdPercentCLI HouseholdMixedWeight
   <chr>        <chr>        <date>                     <dbl>                <dbl>
 1 hrr270       HRR 270      2020-05-08                   0                  0.333
 2 hrr270       HRR 270      2020-05-08                   0                  0.333
 3 hrr270       HRR 270      2020-05-08                   0                  0.333
 4 hrr270       HRR 270      2020-05-09                   0                  0.167
 5 hrr270       HRR 270      2020-05-09                   0                  0.167
 6 hrr270       HRR 270      2020-05-09                   0                  0.167
 7 hrr270       HRR 270      2020-05-09                   0                  0.167
 8 hrr270       HRR 270      2020-05-09                   0                  0.167
 9 hrr270       HRR 270      2020-05-09                   0                  0.167
10 hrr270       HRR 270      2020-05-10                  33.3                0.333
11 hrr270       HRR 270      2020-05-10                   0                  0.333
12 hrr270       HRR 270      2020-05-10                   0                  0.333
? raw aggregations 
# A tibble: 3 x 6
  Date       HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
  <date>      <int>      <dbl>        <dbl>           <dbl>                    <dbl>
1 2020-05-08    270        0           0                  3                        3
2 2020-05-09    270        0           0                  6                        6
3 2020-05-10    270       11.1         9.07               3                        3
? smoothing setup 
# A tibble: 3 x 9
# Groups:   HRRnum [1]
  Date       HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses Valid RowWeight WindowWeight
  <date>      <int>      <dbl>        <dbl>           <dbl>                    <dbl> <lgl>     <dbl>        <dbl>
1 2020-05-08    270        0           0                  3                        3 TRUE          3            3
2 2020-05-09    270        0           0                  6                        6 TRUE          6            9
3 2020-05-10    270       11.1         9.07               3                        3 TRUE          3           12
? after 7-day smoothing 
# A tibble: 3 x 6
  Date       HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
  <date>      <int>      <dbl>        <dbl>           <dbl>                    <dbl>
1 2020-05-08    270       0            0                  3                        3
2 2020-05-09    270       0            0                  9                        9
3 2020-05-10    270       2.78         2.27              12                       12
? after jeffreys 
# A tibble: 3 x 6
  Date       HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
  <date>      <int>      <dbl>        <dbl>           <dbl>                    <dbl>
1 2020-05-08    270       0           12.5                3                        3
2 2020-05-09    270       0            5                  9                        9
3 2020-05-10    270       2.78         4.19              12                       12

statsmaths commented 4 years ago

Gotcha. Yes, that does seem like a bug in either the implementation or in the documentation (your pick). The later explains that its is simply doing 7-day pooling of all the data. And yes, it is possible to "redo" the standard errors around a new mean, though I'm not 100% sure that holds when there are sample weights.

This leads to a larger question: is the value MixingCoefficient recomputed for each 7-day window, or are HouseholdMixedWeight just computed for each day and used in the rollup? Again, according to the documentation it would seem as though it should be..... If not, that really changes my approach. It also explains why my implementation is so much slower as I am doing a lot more work than the reference implementation (I recompute the mixing coefficient for every single estimate).

krivard commented 4 years ago

HouseholdMixedWeight is computed once for each timeInterval X region, then used in the rollup. TimeInterval is either day or epiweek; day is used in the smoothed figures. This makes the mixing conservative -- no household light enough for its day X region can be too heavy for any smoothing window, but some households light enough for one or more of their smoothing windows will nevertheless use the down-adjusted weight from being too heavy for their day X region.

statsmaths commented 4 years ago

Thanks and understood. Yes, I was following the documentation and implemented the exact, less conservative mixing method. That likely a non-trivial reason my code was running slower.

At this point, I would like to suggest that either (1) before I continue, someone updates the reference documentation to actually match all the edge-cases and nuances** implemented in the code, or (2) the reimplementation task be transitioned over someone who both has access to actual facebook data and was involved in producing the original code. It does not seem very productive for me to spend hours hunting after all of these edge cases only to discover that the only mistake is that the reference implementation is not following its own spec.

** For example: how to handle the different different types of weights, mixing, the selective use of Bayesian SE corrections, the selective use of Bayesian point estimators, the sample size and effective sample size computations, explaining how pooling is actually supposed to work.

krivard commented 4 years ago

Agreed, at this point we want to use the limited time we have you for on more traditional code reviews and not debugging. @capnrefsmmat and @amartyabasu can take over. Thanks!

statsmaths commented 4 years ago

Great. Just committed my final set of changes in ea1c980079f496050165e43c6e9292ee801922d0. If there are any questions about what I've done or where I was continuing to run into divergent results, feel free to ping me as needed.

capnrefsmmat commented 4 years ago

I'll plan to pick this up this week, first by expanding the method documentation so it's clearer what I'm implementing; once the code works, we can pass it to @amartyabasu to generalize to other signals from the Facebook survey.

capnrefsmmat commented 4 years ago

Update: 05d7b1e changes the mixing weight logic to match (I think?) the original version, as well as adding a lot of comments explaining it. I have not run the code yet, though; I'll instead look through other functions for weight-related topics, update their documentation, and make sure the logic appears to match the original.

capnrefsmmat commented 4 years ago

The weight changes mean that all of test-gold.R (which tests the output on synthetic data against the old pipeline's output) passes except the standard errors, so I will investigate SEs next.

capnrefsmmat commented 4 years ago

To-do list for me and @amartyabasu:

[x] Implement megacounty aggregation in the new pipeline.
[ ] Determine if the sample_size column in the output files should contain sample size or effective sample size (which adjusts for the weighting). The old pipeline provides both columns, but the API only reports sample size; would it be more appropriate to report effective sample size? (@brookslogan, any opinion?)
[x] Expand the README.md to document the entire pipeline, the meaning of the weights, and so on. I'm expanding the function documentation, but it'd be useful to have a high-level overview.
[x] Figure out why the SEs I'm generating are not correct, and make all tests pass.
[x] Produce weighted community question signal and test it (see #37). We do have to include weights regardless -- we have to account for ZIP5s sometimes spanning multiple counties, which is handled by weighting responses.
[x] Do a bit of profiling and make modest performance improvements. (Enough to make it useful; we can save fancy optimization for the hackathon.)

capnrefsmmat commented 4 years ago

Brain dump before I forget:

I've tracked down part of the problem to how geographic aggregations are done in the two systems. Imagine there is one ZIP code that spans multiple counties within the same state.

In the old code, when we do state aggregations, we do it by mapping ZIP -> county -> state. At the last step, we group together by ZIP and state, so that a ZIP in multiple counties only has one entry in the ZIP -> state mapping.

In the new code, this step is not done. The result is that when there are observations in ZIPs that span counties, and we join with the ZIP -> state mapping, we end up with multiple rows for each observation. Those rows have the correct WeightInLocation (which accounts for ZIPs not being completely contained within counties), so those weights sum to the same value as in the old code.

But since we end up with rows with various weights -- unlike the old code, where the ZIPs are grouped together and tend to have WeightInLocation of 1 -- the mixing code can kick in and change the weighting, resulting in different estimates.

I believe the old code's geographic aggregation behavior is more desirable, since the fact that ZIPs can span counties is not relevant to how we should estimate states. I'll work on fixing this.

capnrefsmmat commented 4 years ago

ef8a042 indeed fixes it! We now match the reference pipeline for raw rates at states, MSAs, and HRRs. Megacounties are not implemented, so counties do not match (I added this to the list above).

The weighted pipelines still produce the wrong estimates and SEs, but since the raw pipelines are okay, I feel better about my prospects of tracking this down.

Status of the household question (still not checking on community):

Signal	County	MSA	HRR	State
Raw, no smoothing	needs megacounties	✓	✓	✓
Raw, smoothing	needs megacounties	WIP	WIP	✓
Weighted, no smoothing	needs megacounties	WIP	WIP	WIP
Weighted, smoothing	needs megacounties	WIP	WIP	WIP

I am going to ignore smoothing for now (as noted above, Taylor implemented it differently from the original, and I think Taylor's method is actually better insofar as it just applies the exact same logic to 7-day windows as it does to days, rather than grouping days to get windows) and focus on making weights work correctly.

krivard commented 4 years ago

Updated gold set of results on the synthetic data: http://rinkitink.ml.cmu.edu/delphi/misc/gold-5june.tgz

capnrefsmmat commented 4 years ago

My latest commits get us to here:

Signal	County	MSA	HRR	State
Raw, no smoothing	needs megacounties	✓	✓	✓
Raw, smoothing	needs megacounties	?	?	✓
Weighted, no smoothing	needs megacounties	✓	✓	✓
Weighted, smoothing	needs megacounties	?	?	?

This is for the cli and ili signals. Here ? is simply because we expect the smoothed output to be different.

Next I must do the same for all the binary community signals calculated in binary.R; I expect that now that I have everything worked out, this should be easier. (But I am prepared for the programming gods to smite me.)

capnrefsmmat commented 4 years ago

The binary community signals now match the old pipeline, except that geo_ids like HRRs are reported as numbers and are not left-padded to three digits. That should be easy to hack around.

This includes weighted community signals, though the old pipeline did not generate these, so we have nothing to test against.

Open questions about the community questions:

Should the same mixing logic be applied to them? Presently, no mixing is done, which means an estimate can be disproportionately influenced by one observation if survey weights and location weights should so demand. It's almost trivial to turn on the mixing here.
What should the weighted versions be named? The code will give them the name raw_wnohh_cmnty_cli and smoothed_wnohh_cmnty_cli (and same for hh_cmnty_cli), rather than raw_nohh_cmnty_wcli etc.

I will just tidy up the tests and documentation, since Amartya is working on megacounties, which are the last step to switching over to this pipeline for the API.

capnrefsmmat commented 4 years ago

With the latest commits to https://github.com/cmu-delphi/covidcast-indicators/pull/100, the Facebook pipeline now produces community signals correctly and produces all megacounties. The only obstacles to deployment are

[x] check that the individual response-level output files match the old ones, so users do not see a change
[x] make the code fast enough to be practical for daily use
[ ] hooking it up into Brian's automation framework

Separately, I should work on expanded documentation, e.g. of the difference between full weights and step 1 weights, and of the individual file output format and question coding.

krivard commented 4 years ago

[x] Complete the port of the validation system

Validation is a crucial part of the current Facebook pipeline; I fix validation errors every few days.

krivard commented 4 years ago

The validation system is nearly complete, and it's complete enough to show validation errors when we run the new pipeline on real data. The errors are more than one would expect from backfill or boundary effects from smoothing, so something is still up.

krivard commented 4 years ago

Remaining tasks are being tracked in #150 and #59

cmu-delphi / covidcast-indicators

[fb-survey] synthetic dataset and unit tests #20