Closed krivard closed 4 years ago
See https://github.com/cmu-delphi/covid-19/pull/73
I'll leave this open for @statsmaths to note any additions or changes that have to be done to the synthetic data, and to track the cleanup and testing of the Facebook code.
Collecting here a list of cases the code habitually needs to handle that won't necessarily get caught by the synthetic data:
--full
argumentstoken
when sorted by StartDate
--full
arguments with a savefile when --load
is specified, keeping the earliest record for a token
when sorted by StartDate
, and excluding that record from recoded.cleaned.recorded.partA
iff it was originally excluded (debugging this was no fun at all)Integrity checks can't be run with any kind of synthetic data. They include:
We used to need to handle files that were encoded in Mountain time instead of Pacific, but with the Qualtrics API that's not necessary anymore. It should stay in, just in case anyone ever needs to load an old file (generated before 20 April), but it's not a testing priority.
I have had a chance now to starting looking at the synthetic data and existing code. There is a lot of code in the current pipeline that has been copied and slightly changed (with little to no comments). This makes it nearly impossible to try to reorganise the way I did for JHU and GHT, namely refactoring the code into smaller functions and cleaning up as I went along. It seems like a better approach will be starting over with a new code base, and borrowing existing code chunks as possible. In order to do this, I need to understand what the code is supposed to be doing. Note that I am (at least currently) only attacking what's covered in the R code inside of the covid-19/facebook
directory and not consider the pull of data from the directory covid-19/qualtrics-cron
.
Here is my current understanding of the code and data:
token
that we use to match with the weights from Facebook and the field A3
captures the self-reported zip code. Currently we are interested in the responses A1_1
through A1_5
, which are of the form "Does someone in your household have this symptom" and the data are recorded as 1="No", 2="Yes", and NA="Did not answer".Here are the outstanding questions I have and was not able to find in the code after looking at it for the better part of the day:
Apologies for the all the questions. I've done my best going through the code, but I am still having trouble with some of the details. If there is some other documentation no in the main GH that would answer some of these, please point we to them as well!
Working on creating a single script that replicates a cleaned-up version of main logic of the existing code. Currently committed on the fb branch with: 4fc709b649a1725096aa1ad6d34c5bcc45b18c16. See the file here for the current code: test_script.R
Monumentous!
Some additional weird logic: Currently I call run.R three times (once for each of three different output sets), and while we always start at 00:00:00 and end at 23:59:59, each output needs a different date range (it's gross, i know). Taking the run from 2020-05-21 as an example:
--output | --start | --end | --load? |
---|---|---|---|
cids | 2020-05-20 | 2020-05-20 | no |
individual | 2020-05-14 | 2020-05-20 | no |
covidalert; archive; community | 2020-04-06 | 2020-05-20 | yes |
Why this is necessary:
We can probably merge (1) and (2) together by just saying only output CIDs for the most recent day in the range. Then if we solve (3), we can put everything on the same 7-day range. If not, we can probably do a similar thing with "only output individual responses for the last seven days".
We have occasionally needed to do mass re-generations of cumulative CID lists for debugging purposes, so whatever we do should permit control through params.json even if there is a standard configuration for the normal automated behavior.
Small nit: you probably meant name
instead of "full"
here:
write_cid <- function(data, name)
{
fname <- sprintf(
"cvid_cids_%s_response_%s_-_%s.csv",
"full",
format(start_time, "%H_%M_%Y_%m_%d", tz = tz_to),
format(end_time, "%H_%M_%Y_%m_%d", tz = tz_to)
)
Thanks for the additional information and catching the bug in write_cid
. I was confused about where the past data was being merged in for the smoothing step; good to know that I was not missing anything but it simply has not been implemented. I think we can build in some logic to avoid having to run the entire thing three times, while still making it flexible for backfill. A brute-force solution would be to include three different start dates, but as you suggest perhaps something more elegant can be done as well.
Likely I am misunderstanding something in the smoothing logic, but I would have thought that was only a need to go back 7 days given that it seems that smoothing only occurs within a 7-day window.
I pushed a revised version of my single script that now handles the smoothing of the signal, grabbing the community data, and storing the archive. From there, it should be close to ready to wrap up into documented and tested functions. It will need some very close code review, as it is not as easy for me to directly verify that it produces output consistent with the existing code.
I pushed a commit today that has a packaged version of the refacted/rewritten facebook survey code. You can see the current structure here.
Generally, it's well covered with both unit and integration tests. For integration tests, I have no real data to compare against, so I made another very small set with only 20 responses put into just 2 zip codes to try to sanity check that the results are reasonable. While its possible that there are some edge-case bugs, I believe that the current code is mostly accurate for the archive
, weights
, and individual
outputs. For the community
and covidcast
outputs, I believe that the raw and weighted versions are also correct.
There are now two blocking things that I am not entirely understanding. Namely:
The household CLI and ILI is computed using importance sampling; more information may be found here:
For smoothing, in theory we're taking the weighted average of all data collected in the last seven days. If there is not at least one day of raw data in the last 7, we don't report a smoothed value for that day.
Thanks. Trying to read through these docs, but I'm still somewhat confused though (sorry!). I actually find the documentation you sent fairly straightforward, but I'm not sure how it completely explains the code.
Specifically, there is a lot of code that seems to be focused on something related to "mixing". For example, in aggregations-setup.R on lines 161-187. I can't figure out where this switch statement is described in the documentation or what I should expect it to do.
Searching through the three documents you linked did not surface anything with a similar name to "Mixing Coefficient".
Thanks again for dealing with all my questions!
Ah right, mixing -- sorry about that. The mixing code ensures that if some household is responsible for more than 1/100 of the score for the whole dayXregion bucket for whatever reason, then for small dayXregions (n<=100) we replace the normalized weight with 1/n, and for larger ones we use a number in between 1/n and the normalized weight, where "in between" depends on exactly how large the dayXregion is versus the maximum normalized weight in the dayXregion.
This should 100% be added to the signals explanation in the PDF -- @ryantibs @brookslogan I know documentation is no one's priority but this is the 3rd time I've had to explain this part of the facebook process (to umd twice, to lester, and now to taylor) and my stats knowledge is decidedly underpowered for the job. Can one of you either update the signals doc, or coordinate with me to schedule a call and I'll take dictation?
Great, glad I was not going crazy.
Currently this is not implemented for the Facebook Community surveys. That makes sense: we were previously not creating weighted versions and it's just a binary response, so no household can be more than 1/100 of the score if there are more than 100 households (the current reporting cut-off). I was asked to build in a weighted version of the community survey. Should mixing be implemented there as well?
Oof, probably yes, though I'm not sure if the change in statistic means any of the mixing thresholds need to change.
I wasn't even aware of this! @brookslogan It'd be ideal if Logan could just directly put a new part in signal_descriptions.tex to explain this.
I just pushed edits to the refactored facebook code. I believe that it now covers everything other than megacounties. Though as mentioned in my other notes, these will need to be carefully checked and possible debugged by someone with access to the full dataset.
@ryantibs @brookslogan — Going through the documentation, there are actually a number of differences between / details missing in the current documentation for the facebook signal. For example:
Yes, the Jeffreys SE was applied to many of our survey estimators, not just Facebook, primarily because the Facebook household signal occasionally reports rates of exactly 0 with corresponding SEs of 0, which made for very misleading time series plots. We decided to use the Jeffreys approach to avoid this. Logan then did simulations and found that the Jeffreys estimate of proportion had worse properties for the household signal, but the SE was fine; I think there's a separate file somewhere containing his simulations. We changed all other proportion estimators to be consistent with this, even though most of them never report estimates of exactly 0.
I definitely agree this should be clearly documented, though.
Recapping message regarding FB HH signals: we wanted a unified approach to adjusting se's across different sources (to at least avoid se=0), but FB HH is a little different from others based on binomial proportions. Alex suggested the moral equivalent of the Jeffreys prior-based adjustment was to add a single pseudo-observation of 0.5. However, simulations suggested that this would introduce significant MSE in the mean estimates for "small" numbers of households, and that a significant portion or most county estimates were based on "small" numbers of households. So we went with the following:
The (uncommented) simulations are at cmu-delphi/covid-19/facebook/sensor/logan/se_adjustment.R.
The good news: with one simple typo fix (b1f2281, which probably should get a unit test too), the pipeline can run and produce all the unweighted output files. The output is not obviously insane, but I have not checked it against the real pipeline.
The bad news: The weighted output pipeline does not run. Because I'm only using one weights file, most responses have no weight, and mix_weights
does not account for the possibility of weights being NA
. Also, it took the pipeline 20 minutes to get to this point, and it still hadn't gotten to any aggregation above counties.
I'm not sure why this failure doesn't occur with the synthetic data, since I deliberately included responses with no weights. I'll test it out more.
I was able to make the pipeline faster with the tiny tweak in 2cf180c; see details in the commit message. There's still plenty of room to improve. Fully 50% of the code's time is spent inside past_n_days
, mainly in ymd
parsing dates. If there's some way to parse all the dates once, or use Dates and never parse, the pipeline could be much faster.
Before you go and optimize it, @krivard is considering using this as part of the INI hackathon, since it's a nice discrete problem for them to solve and could get them to do other interesting things for us.
The pipeline failure was a mistake in my synthetic data generator. I intended it to fail to provide weights for some tokens, but (I think by luck) it provided weights for every token. Hence the new pipeline failed whenever a weight was not available and it was producing weighted output. I fixed this in b6e08e1.
I'm still waiting for the pipeline to finish running, but it has successfully produced some of the weighted output files, and they are not obviously crazy.
@krivard, before you check if it produces matching output, you have to set "debug"
to false
in params.json
, since debug mode seems to drop the output filtering threshold to a sample size of 2 instead of 100, as well as tweaking some other constants (see read_params
in utils.R
).
One other code comment for @statsmaths: create_data_for_aggregatation
and filter_data_for_aggregatation
are spelled incorrectly.
Thanks for taking a look at the code and seeing how it runs on the real data! Just to re-iterate what I said yesterday, I was handing this off not because I thought it was production-ready. Rather, I had hit a wall where I needed some help from someone with access to the actual data. Synthetic data only goes so far because I was not able to get the current code base to run over it. Not at all surprised there are some bugs and inefficiencies.
Just a few comments about the current code:
past_n_days
, that's an easy fix. It's only used to figure out which days in the past to grab for pooling. Right now it is called for each unique region (which is completely pointless). It could either be replaced with a look-up table, cached with memorisation, or just pulled out of the loop. test-run.R
) that verify reasonable results on a very small test dataset. I have not yet done unit tests on the intermediate-level functions.Maybe we can discuss on the call today the best way forward. Happy to talk independently through the code as well, if that would be helpful.
I think the inefficiency is not just past_n_days
but everything in the loops in summarize_binary
and summarize_hh_count
; all of the filtering to calculate index
must similarly be done to every row in the table for every loop iteration.
I think a faster approach would skip expand.grid
. Instead, create an empty data frame in advance, then manually loop over days and geo_ids. Once per day, calculate the matching dates; once per geo_id, subset the data frame to just the matching rows. This can bypass a lot of the filtering and indexing going on.
Now that it runs, @krivard could set params.json
to just generate data for a single day and a single type (say, community), with debug
set to false, and see how the output looks in comparison to the original pipeline.
That makes more sense, as I would not have expected past_n_day
to be particular bottleneck.
I don't think the expand.grid
itself is the issue (it's just creating the empty data frame you are talking about). But I do see how it could be slightly faster to use a double loop over date/geo_id rather than over the rows of the empty data frame. That would allow moving some of the logic into the intermediate part of the loop.
The real challenge to me are the mixing coefficients. If it were not for these, we could compute summary statistics, both weighted and unweighted, for each combination of (zip code, day) and base everything on those. I was trying to think of a way around this, but couldn't come up with an obvious (and not overly involved) solution.
I've completed a run to compare output on data from 2020-04-06 to 2020-04-10, and the results are quite a bit off even just by what set of counties gets output:
$ wc -l cvc-indicators-version/20200406_county_raw_cli.csv covid-19-version/20200406_county_raw_cli.csv
31 cvc-indicators-version/20200406_county_raw_cli.csv
270 covid-19-version/20200406_county_raw_cli.csv
301 total
Sample diff (another three screens of missing counties not included here)
diff -y cvc-indicators-version/20200406_county_raw_cli.csv covid-19-version/20200406_county_raw_cli.csv
geo_id,val,se,sample_size geo_id,val,se,sample_size
04013,44.3572906,1.4498736,548.8718743 | 01073,1.4132586,0.7604313,233.526
04019,52.6954747,2.7525858,164.9430944 | 01089,0.7408512,0.5982235,174.4847
06037,43.7372538,2.4792932,140.5401676 | 01097,1.09375,0.7034121,160
08005,39.0494591,2.7415579,110.4806125 | 02020,0.3683377,0.4558624,180.993
08031,55.3128546,2.530657,131.3631438 | 04013,1.0746494,0.2098514,1915.622
08041,40.251613999999996,2.7959407,120.0561602 | 04015,1.3071895,1.1199949,102
09009,46.6691439,3.2180663,101.999963 | 04019,1.2135946,0.4110941,537.7963
15003,41.5030172,3.1969783,110 | 04021,0.3576469,0.3563916,178.3127
17031,48.3062743,1.806412,357.2605875 | 04025,0,0.4246168,115.4545
18097,49.0224168,2.8644961,133.7436918 | 05007,0.8661774,0.6275794,118.8501
21111,47.9458318,3.2493972,116.1586522 | 05119,0.314969,0.399733,188.9697
25017,47.8523074,2.7616648,148.647317 | 05143,1.7282808,1.2799919,103.0884
26125,51.7855901,3.1446726,122.472028 | 06001,0,0.3691187,133.079
26163,42.9263965,2.3757308,186.0581247 | 06013,1.1243836,1.0415355,105.9102
27053,52.1378772,2.8961085,138.4090241 | 06037,0.8330712,0.3660343,623.3862
29095,47.1695129,3.2910707,104.0123169 | 06059,1.2986553,0.7705809,230.5052
29189,46.5558509,2.9522391,125.8928667 | 06065,0.6824326,0.5655071,203.1644
32003,44.7634749,2.0542837,278 | 06067,0.5718681,0.6326293,174.8062
35001,50.1359557,3.2817277,106.9704347 | 06071,0.5871644,0.6507125,167.1167
39049,53.2175648,3.0944069,112.1256353 | 06073,0.4973844,0.337318,318.0391
40109,45.5708293,3.0880728,114.8838432 | 06085,0.3223091,0.4476771,154.1283
40143,50.7868425,2.902578,121.8608058 | 08001,1.0730273,0.5142252,268.6464
41051,44.8330718,2.7092214,126.89848310000001 | 08005,1.9754036,0.6413415,319.8024
42003,49.28301,2.8914538,101.2077825 | 08013,0.7446767,0.5845977,168.4264
48201,38.9161671,3.0010727,109.6267507 | 08031,1.831484,0.5693519,409.6245
49035,46.9128873,2.5511804,175.9871201 | 08035,2.8148411,1.0407039,209.7201
53033,51.2549591,1.7877749,294.6656528 | 08041,1.3853901,0.5192886,404.7802
53053,44.7066819,2.9447276,112.08183820000001 | 08059,1.0846624,0.5252888,298.0171
53061,45.6034937,2.3997136,125.975825 | 08069,1.0261345,0.6658186,191.8521
55079,47.3288811,3.2039896,121.9999487 | 08101,1.5888499,1.213278,104.8403
> 08123,1.9584283,0.8917057,160.79
> 09001,1.1411065,0.474968,331.9404
> 09003,0.7673776,0.4099026,347.018
> 09009,1.5786981,0.6288789,342.9508
> 09011,2.9359863,1.3433356,138.2995
> 10003,1.4772609,0.76354,247.4745
> 10005,1.8899608,1.2920725,113.3599
> 11001,0.5722461,0.3934854,233
> 12011,1.4157706,0.7486386,186
> 12031,1.4362475,1.0040805,121.6342
> 12057,0,0.2696038,182.0031
> 12071,0,0.4157254,118.3756
> 12086,0.1650167,0.2951203,201.9998
> 12095,0.4707617,0.5201097,211.9404
To help reduce the need to back-and-forth this, I've run the synthetic data through the current pipeline, using the covidcast-indicators debug settings for the various thresholds, to get you a set of gold results to compare with... obvious in retrospect we should have done this at the beginning. Here is a tarball of the gold results on the synthetic data.
Thank you very much; that's exactly what I needed. I've been playing around with the data now that I have something to test against and have identified several issues that were likely behind some the largest difference between our estimators. For one thing, I had the definition of the A1_
values reversed, which is why my estimates are much larger than yours (N.B: there are not yet pushed; just locally fixed on my end at the moment).
And apologies, did not mean to close the issue. Will update with questions as I continue to work through the differences.
After going back and forth over the data all afternoon, I think I have a pipeline that is much closer to reproducing the original results. At the state and HRR levels, I can approximately reproduce the "raw_ili", "raw_cli", "smoothed_cli", and "smoothed_ili". The values and returned regions are exact; the standard errors and sample sizes are within a factor of 0.5 from one another (most are exact matches, but there are some small differences). Most of the county and MSA regions are exact matches for the raw data, but the values start diverging for the smoothed estimates.
I am starting to think that there may be some difference between the county weights that I am using and the ones that were used to produced the gold standard. Are the files in "covid-19/geographical_scope" the correct files? I am finding some difference between the sample sizes in my output and the gold results that I cannot wrap my head around given the data. Specifically, there are a few raw, unweighted gold standard sample sizes that are non-integers, but manually looking at my geo data all of the corresponding zip codes lie entirely in one state.
The files in covid-19/geographical_scope
should indeed be the exact ones used by the existing pipeline.
Are you saying that when producing a state estimate, the unweighted code produces non-integer sample sizes, even though all the zip codes reporting for that state are entirely contained within the state? If so, can this be affected by the mixing logic?
Yes, exactly. Here's the gold version of the ILI state-level data for Virginia (sample size is 51.6715):
[local] $ cat receiving_gold/20200509_state_raw_ili.csv | grep "va"
geo_id,val,se,sample_size,effective_sample_size
va,0,0.9451159,51.6715,51.903563
But when I pull from the raw data, all of zip codes in the dataset from that time period at entire within Virginia (at least to some rounding error):
library(readr); library(dplyr); library(stringi); library(delphiFacebook)
# create crosswalk file zip5 => state
cw <- produce_crosswalk_list("static")
state_zips <- cw$state[cw$state$geo_id == "va",]
input_data <- read_csv(dir("input", full.names = TRUE))
input_data <- input_data[seq(3, nrow(input_data)),] # the first two rows are junk
input_data$day <- stri_sub(input_data$StartDate, 1, 10) # TZ is already in Pacific
input_data <- arrange(input_data, StartDate)
input_data <- filter(input_data, !duplicated(token))
state_data <- input_data[!is.na(match(input_data$A3, state_zips$zip5)),]
state_data <- filter(state_data, day == "2020-05-09")
state_data <- left_join(state_data, state_zips, by = c("A3" = "zip5"))
range(state_data$weight_in_location)
[1] 0.9999 1.0000
My understanding of the mixing logic is that it should only effect the unweighted responses. Any idea what might be going on here?
My understanding of the mixing logic is that it should only effect the unweighted responses
False -- Mixing is done on both weighted and unweighted signals. For unweighted signals, the importance weight for each survey response is set to 1, and if the zip is entirely within the region being aggregated the location weight is also 1, but then the HouseholdMixedWeight (=importance weight times location weight) is normalized before it is put through the mixing logic. Whether or not it's affected by mixing will depend on your s_weight
setting.
Mixing should not affect sample_size
though, since it's derived from the location weight, not the mixed weight.
I'll take a look at that example and see what I can pull out from the reference implementation.
This response appears to be the culprit:
ResponseId WeightInLocation ZIP5
<chr> <dbl> <chr>
1 HbywejwldJxxKAoEC 0.672 24523
> state.membership %>>% dplyr::filter(ZIP5==24523)
# A tibble: 2 x 4
ZIP5 LocationCode LocationName WeightInLocation
<chr> <chr> <chr> <dbl>
1 24523 statefips51 va 0.672
2 24523 NA NA 0.328
$ grep "^24523" 02_20_uszips.csv
24523,51019,Bedford,428,49004,20085,Bedford,VA,Virginia,TRUE,37.34003,-79.5221,,32.5,"{'51019':67.17,'51515':32.83}",Bedford|Bedford,51019|51515,FALSE,FALSE,America/New_York,573,ROANOKE-LYNCHBURG,31340,"Lynchburg, VA (Metropolitan Statistical Area)"
$ grep "^[^,]*,51515" ../../geographical_scope/02_20_uszips.csv
$
So it looks like the zip code maps to two counties, one of which is in theory in virginia (51 FIPS prefix) but is not listed in the mapping file. We know that our zip->county mappings are a little outdated (#35) but it's not clear what the actual right thing is here -- clearly the new code takes the state prefix to mean "it's 100% in virginia so 🤷♀️" while the old code says "it's 32% somewhere that does not actually exist, so there's no way to know whether they live in virginia or not 🤷♀️".
For the sake of finishing the debugging, I'd recommend matching the reference implementation for now; once we know we can match it we can either revert to believing the FIPS prefix and record the change in the release notes, or just wait until #35 fixes everybody at once.
(also worth asking what the other sensors are doing with bogus FIPS codes; consistency across data sources would be nice)
Thanks! Yes, making that change seems to fix the original issue. However, I would argue that my original code is correct and that the current implementation is a bug. The file 02_20_uszips.csv
is a lookup table for zip codes; when a zip is in more than one county/state it's mapped to the county and state that occupies the largest proportion of the zip. While rare, there are some county-equivalents that are not the majority part of any zip code, and so you cannot just use to the zip lookup table as a proxy crosswalk for county=>state. However, I do think its best to try to match the current implementation in the interest of time.
Regarding the comment about the mixed weights, yes 100% agree. Just a problem with there being an overloading of the term "weighted". I only meant that if all the weights (both inverse probability and geographic) are uniform, there should (and really can't) be any mixing.
Unfortunately, I've hit another discrepancy that I can't figure out how to fix. It's a bit easier to show, though, because you only need to look at the output. Here is one row in the gold standard output for HRR=270:
[local] $ grep "270," receiving_gold/20200510_hrr_smoothed_cli.csv
270,2.7777778,4.1926104,12,12
My code agrees on the value and sample sizes (yay!), but has a slightly different SE. I am fairly certain that the raw data for this HRR has 12 households where one household of 3 people has 1 person with CLI. Here's the calculation I have for the standard error:
x <- c(1, rep(0, 11))
n <- c(3, rep(1, 11)) # not sure of n_i, i>1, but shouldn't effect calculation
m <- 12
val <- 100 / m * sum(x / n)
se_raw <- 100 * 1 / m * sqrt(sum((x / n - val / 100)^2))
se_jeffrey <- sqrt((50 - val)^2 + m^2 * se_raw^2) / (1 + m)
cat(sprintf("val = %f, se_raw = %f, se_jeffrey = %f\n", val, se_raw, se_jeffrey))
val = 2.777778, se_raw = 2.659520, se_jeffrey = 4.384249
That's my understanding of what the computation is supposed to be according to the documentation. You'll see it matches the val
exactly and the SE is close but too far off enough to not be due to rounding error. This is raw HRR, so there should not be any pesky weights messing something up (plus, the reported sample size is an integer). I'm worried that I am missing something about the "correct" definition of the SE.
It looks like your se_raw
is off? I get 2.27 in the reference implementation. Note that we take the raw daily aggregations first, then use skiproll to pool the results, including the standard deviation:
SDPercentCLI = sqrt(pmax(0,zoo::rollapplyr(SDPercentCLI^2*RowWeight^2,k,sum,partial=TRUE,na.rm=TRUE)/WindowWeight^2))
where RowWeight
is the daily sample size and WindowWeight
is the 7-day sample size.
I think what that means is that instead of doing
(x_i / n_i - val / 100)
for the i-th response in the window whose average CLI is val
,
we're doing
(x_ij / n_ij - val_j / 100)
for the i-th response on the day whose average CLI is val_j
. In this instance, you're subtracting (0.333 - 0.278), where we were subtracting (0.333 - 0.111), since the household with 1/3 CLI occurred on a day with 2 other households, both 0. This does sound like a bug according to the declared smoothing method, but I also don't know what the correct description of this implementation would be, or whether there's a way to re-center an existing se computation around a new mean after discarding the individual sample data, as we do in the reference impl.
Here are the intermediate values for this HRR according to the reference implementation, starting from individual responses:
? individual records
# A tibble: 12 x 5
# Groups: TimeInterval, LocationCode, LocationName [3]
LocationCode LocationName TimeInterval HouseholdPercentCLI HouseholdMixedWeight
<chr> <chr> <date> <dbl> <dbl>
1 hrr270 HRR 270 2020-05-08 0 0.333
2 hrr270 HRR 270 2020-05-08 0 0.333
3 hrr270 HRR 270 2020-05-08 0 0.333
4 hrr270 HRR 270 2020-05-09 0 0.167
5 hrr270 HRR 270 2020-05-09 0 0.167
6 hrr270 HRR 270 2020-05-09 0 0.167
7 hrr270 HRR 270 2020-05-09 0 0.167
8 hrr270 HRR 270 2020-05-09 0 0.167
9 hrr270 HRR 270 2020-05-09 0 0.167
10 hrr270 HRR 270 2020-05-10 33.3 0.333
11 hrr270 HRR 270 2020-05-10 0 0.333
12 hrr270 HRR 270 2020-05-10 0 0.333
? raw aggregations
# A tibble: 3 x 6
Date HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2020-05-08 270 0 0 3 3
2 2020-05-09 270 0 0 6 6
3 2020-05-10 270 11.1 9.07 3 3
? smoothing setup
# A tibble: 3 x 9
# Groups: HRRnum [1]
Date HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses Valid RowWeight WindowWeight
<date> <int> <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <dbl>
1 2020-05-08 270 0 0 3 3 TRUE 3 3
2 2020-05-09 270 0 0 6 6 TRUE 6 9
3 2020-05-10 270 11.1 9.07 3 3 TRUE 3 12
? after 7-day smoothing
# A tibble: 3 x 6
Date HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2020-05-08 270 0 0 3 3
2 2020-05-09 270 0 0 9 9
3 2020-05-10 270 2.78 2.27 12 12
? after jeffreys
# A tibble: 3 x 6
Date HRRnum PercentCLI SDPercentCLI NumberResponses EffectiveNumberResponses
<date> <int> <dbl> <dbl> <dbl> <dbl>
1 2020-05-08 270 0 12.5 3 3
2 2020-05-09 270 0 5 9 9
3 2020-05-10 270 2.78 4.19 12 12
Gotcha. Yes, that does seem like a bug in either the implementation or in the documentation (your pick). The later explains that its is simply doing 7-day pooling of all the data. And yes, it is possible to "redo" the standard errors around a new mean, though I'm not 100% sure that holds when there are sample weights.
This leads to a larger question: is the value MixingCoefficient
recomputed for each 7-day window, or are HouseholdMixedWeight
just computed for each day and used in the rollup? Again, according to the documentation it would seem as though it should be..... If not, that really changes my approach. It also explains why my implementation is so much slower as I am doing a lot more work than the reference implementation (I recompute the mixing coefficient for every single estimate).
HouseholdMixedWeight
is computed once for each timeInterval X region, then used in the rollup. TimeInterval is either day or epiweek; day is used in the smoothed figures. This makes the mixing conservative -- no household light enough for its day X region can be too heavy for any smoothing window, but some households light enough for one or more of their smoothing windows will nevertheless use the down-adjusted weight from being too heavy for their day X region.
Thanks and understood. Yes, I was following the documentation and implemented the exact, less conservative mixing method. That likely a non-trivial reason my code was running slower.
At this point, I would like to suggest that either (1) before I continue, someone updates the reference documentation to actually match all the edge-cases and nuances** implemented in the code, or (2) the reimplementation task be transitioned over someone who both has access to actual facebook data and was involved in producing the original code. It does not seem very productive for me to spend hours hunting after all of these edge cases only to discover that the only mistake is that the reference implementation is not following its own spec.
** For example: how to handle the different different types of weights, mixing, the selective use of Bayesian SE corrections, the selective use of Bayesian point estimators, the sample size and effective sample size computations, explaining how pooling is actually supposed to work.
Agreed, at this point we want to use the limited time we have you for on more traditional code reviews and not debugging. @capnrefsmmat and @amartyabasu can take over. Thanks!
Great. Just committed my final set of changes in ea1c980079f496050165e43c6e9292ee801922d0. If there are any questions about what I've done or where I was continuing to run into divergent results, feel free to ping me as needed.
I'll plan to pick this up this week, first by expanding the method documentation so it's clearer what I'm implementing; once the code works, we can pass it to @amartyabasu to generalize to other signals from the Facebook survey.
Update: 05d7b1e changes the mixing weight logic to match (I think?) the original version, as well as adding a lot of comments explaining it. I have not run the code yet, though; I'll instead look through other functions for weight-related topics, update their documentation, and make sure the logic appears to match the original.
The weight changes mean that all of test-gold.R
(which tests the output on synthetic data against the old pipeline's output) passes except the standard errors, so I will investigate SEs next.
To-do list for me and @amartyabasu:
sample_size
column in the output files should contain sample size or effective sample size (which adjusts for the weighting). The old pipeline provides both columns, but the API only reports sample size; would it be more appropriate to report effective sample size? (@brookslogan, any opinion?)README.md
to document the entire pipeline, the meaning of the weights, and so on. I'm expanding the function documentation, but it'd be useful to have a high-level overview.Brain dump before I forget:
I've tracked down part of the problem to how geographic aggregations are done in the two systems. Imagine there is one ZIP code that spans multiple counties within the same state.
In the old code, when we do state aggregations, we do it by mapping ZIP -> county -> state. At the last step, we group together by ZIP and state, so that a ZIP in multiple counties only has one entry in the ZIP -> state mapping.
In the new code, this step is not done. The result is that when there are observations in ZIPs that span counties, and we join with the ZIP -> state mapping, we end up with multiple rows for each observation. Those rows have the correct WeightInLocation (which accounts for ZIPs not being completely contained within counties), so those weights sum to the same value as in the old code.
But since we end up with rows with various weights -- unlike the old code, where the ZIPs are grouped together and tend to have WeightInLocation of 1 -- the mixing code can kick in and change the weighting, resulting in different estimates.
I believe the old code's geographic aggregation behavior is more desirable, since the fact that ZIPs can span counties is not relevant to how we should estimate states. I'll work on fixing this.
ef8a042 indeed fixes it! We now match the reference pipeline for raw rates at states, MSAs, and HRRs. Megacounties are not implemented, so counties do not match (I added this to the list above).
The weighted pipelines still produce the wrong estimates and SEs, but since the raw pipelines are okay, I feel better about my prospects of tracking this down.
Status of the household question (still not checking on community):
Signal | County | MSA | HRR | State |
---|---|---|---|---|
Raw, no smoothing | needs megacounties | ✓ | ✓ | ✓ |
Raw, smoothing | needs megacounties | WIP | WIP | ✓ |
Weighted, no smoothing | needs megacounties | WIP | WIP | WIP |
Weighted, smoothing | needs megacounties | WIP | WIP | WIP |
I am going to ignore smoothing for now (as noted above, Taylor implemented it differently from the original, and I think Taylor's method is actually better insofar as it just applies the exact same logic to 7-day windows as it does to days, rather than grouping days to get windows) and focus on making weights work correctly.
Updated gold set of results on the synthetic data: http://rinkitink.ml.cmu.edu/delphi/misc/gold-5june.tgz
My latest commits get us to here:
Signal | County | MSA | HRR | State |
---|---|---|---|---|
Raw, no smoothing | needs megacounties | ✓ | ✓ | ✓ |
Raw, smoothing | needs megacounties | ? | ? | ✓ |
Weighted, no smoothing | needs megacounties | ✓ | ✓ | ✓ |
Weighted, smoothing | needs megacounties | ? | ? | ? |
This is for the cli and ili signals. Here ? is simply because we expect the smoothed output to be different.
Next I must do the same for all the binary community signals calculated in binary.R
; I expect that now that I have everything worked out, this should be easier. (But I am prepared for the programming gods to smite me.)
The binary community signals now match the old pipeline, except that geo_ids like HRRs are reported as numbers and are not left-padded to three digits. That should be easy to hack around.
This includes weighted community signals, though the old pipeline did not generate these, so we have nothing to test against.
Open questions about the community questions:
raw_wnohh_cmnty_cli
and smoothed_wnohh_cmnty_cli
(and same for hh_cmnty_cli
), rather than raw_nohh_cmnty_wcli
etc.I will just tidy up the tests and documentation, since Amartya is working on megacounties, which are the last step to switching over to this pipeline for the API.
With the latest commits to https://github.com/cmu-delphi/covidcast-indicators/pull/100, the Facebook pipeline now produces community signals correctly and produces all megacounties. The only obstacles to deployment are
Separately, I should work on expanded documentation, e.g. of the difference between full weights and step 1 weights, and of the individual file output format and question coding.
Validation is a crucial part of the current Facebook pipeline; I fix validation errors every few days.
The validation system is nearly complete, and it's complete enough to show validation errors when we run the new pipeline on real data. The errors are more than one would expect from backfill or boundary effects from smoothing, so something is still up.
Remaining tasks are being tracked in #150 and #59