JHU changed Puerto Rico death reporting, causing us to fail to report it

capnrefsmmat commented 4 years ago

There is no Puerto Rico cases or deaths data available in the API since July 17:

> covidcast_signal("jhu-csse", "deaths_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:deaths_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3     2     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2     1     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1     5     NA          NA

> covidcast_signal("jhu-csse", "confirmed_incidence_num", geo_type="state", geo_values="pr", start_day="2020-07-15")
A `covidcast_signal` data frame with 3 rows and 10 columns.

signals     : jhu-csse:confirmed_incidence_num
geo_type    : state

  geo_value time_value direction      issue lag value stderr sample_size
1        pr 2020-07-15        NA 2020-07-18   3   256     NA          NA
2        pr 2020-07-16        NA 2020-07-18   2   195     NA          NA
3        pr 2020-07-17        NA 2020-07-18   1   546     NA          NA

The JHU time series of deaths seems to support this, showing 0 deaths for all time in every county in Puerto Rico -- but that's because the deaths are listed under "Unassigned, Puerto Rico". We should be ingesting these deaths.

Meanwhile, their time series of confirmed cases shows plenty of cases, but for some reason we are not reporting them.

This is preventing forecasting from issuing death forecasts for Puerto Rico, and will block case forecasts as well.

capnrefsmmat commented 4 years ago

dshemetov commented 4 years ago

I'm looking into tackling this with #215. I'm thinking of splitting the deaths across the FIPS codes based on population data, so that we don't mix state level data with FIPS level data. Would this cause issues down the pipeline? It would help making the geocoding consistent. We can reaggregate the deaths back into commonwealth level and serve only that at the API.

ajgreen93 commented 4 years ago

@krivard , @capnrefsmmat asked me to ping you. The issue with "Unassigned" counts for JHU data seems to be affecting states like Wyoming and Rhode Island (and presumably all other states as well). As a result, it is affecting our state-level forecasts for all states.

(This also seems related to this closed issue.)

krivard commented 4 years ago

In theory, a behavior where we map unassigned cases/deaths to a megacounty that then gets aggregated into the state figures was added in commit 5ff04c0ee487. This commit is present in the commit log for the version of the JHU indicator in production, so if it's not doing the right thing now, then it's likely this was later overridden when we switched to the new geo aggregator. @dshemetov, investigate? it may be worth boosting the priority on merging #215; what would we need to get that done in the next two weeks?

In the meantime, @ajgreen93 iirc we added USAFacts to avoid this exact issue, so you might try switching indicators.

dshemetov commented 4 years ago

Just looked: yes, in deploy-jhu UIDs 840900XX are being mapped to 900XX without being converted to XX000 (the mega-county fix). This is already handled in #217.

dshemetov commented 4 years ago

@krivard Two week merging of #215 is likely doable. What's left are HHS, National, and DMA level geocodes. DMA may take some work to track down the crosswalks.

dshemetov commented 4 years ago

Just looked into the Puerto Rico issues. Like capnrefsmmat reports, we still don't have Puerto Rico cases information in the API after July 17th, despite the data being present in the JHU time_series we pull. Here is the strange thing: the Puerto Rico cases after July 17th are showing up in the receiving folder on the deploy-jhu branch. So I think the issue must be on the ingestion step after we pull.

The Puerto Rico deaths issue has been fixed with the megaFIPS fix.

krivard commented 4 years ago

We are successfully ingesting data from Puerto Rico for the following combinations: (state, county) X (cases, deaths) X (incidence, cumulative) X (num):

$ zgrep -il -e "^72" -e "^pr" /common/covidcast/archive/successful/jhu-csse/20200908*
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_county_deaths_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_confirmed_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_7dav_incidence_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_cumulative_num.csv.gz
/common/covidcast/archive/successful/jhu-csse/20200908_state_deaths_incidence_num.csv.gz

For (state) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline is generating PR data, but it puts inf in the value column, which is not permitted:

$ grep -i "^pr" bad-jhu/20200908*
bad-jhu/20200908_state_confirmed_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_7dav_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_confirmed_incidence_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_7dav_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_cumulative_prop.csv:PR,inf,NA,NA
bad-jhu/20200908_state_deaths_incidence_prop.csv:PR,inf,NA,NA

(this is probably related to #227)

Since this would cause the ingestion mechanism to reject the whole file, while we're waiting for fixes on #227 and #254 I have a cron job running to pick up the JHU files from receiving on the server, and strip out lines with illegal values and geo identifiers before they reach ingestion. This will probably wreak havoc with the diff-based archive utility once we have fixes in place, but it was a better option than having state cases and deaths ratios be completely unavailable.

For (county) X (cases, deaths) X (incidence, cumulative) X (prop), the deploy-jhu pipeline does not appear to be generating any entries for Puerto Rico counties at all.

dshemetov commented 4 years ago

@krivard can you check the same thing but for 20200826? For some reason, our API provides Puerto Rico cases since 20200827, but not before.

cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num",
date(2020, 7, 14), date(2020, 9, 11),
geo_type="county")
cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min()
Timestamp('2020-08-27 00:00:00')

It is likely a population divide by zero issue, but not sure why it's day-dependent?

krivard commented 4 years ago

There are no mentions of 72XXX counties before 31 August in the success files. We only keep backups of the most recently submitted csv for each day, so any day with PR for an earlier issue (before the inf bug) will have been overwritten by a more recent issue (after the inf bug) which has had all the invalid PR lines filtered out.

Recall however that the issue definition for JHU has changed multiple times, including:

all days going back to 2 February
only the last 7 days of raw data and the last 1 day of 7dav data
only the new or updated lines in any new or updated file going back to 2 February

so any appearance of day-dependence of a zero population effect may instead be related to which days of data were in the issue when the population data first went awry.

On Fri, Sep 11, 2020 at 12:33 PM Dmitry Shemetov notifications@github.com wrote:

@krivard https://github.com/krivard can you check the same thing but for 20200826? For some reason, our API provides Puerto Rico cases since 20200827, but not before.

cc_df = covidcast.signal("jhu-csse", "confirmed_incidence_num", date(2020, 7, 14), date(2020, 9, 11), geo_type="county") cc_df[cc_df["geo_value"].isin([str(x) for x in range(72001, 72999)])]["time_value"].min() Timestamp('2020-08-27 00:00:00')

It is likely a population divide by zero issue, but I'm not sure why we have a day-dependent bug.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-delphi/covidcast-indicators/issues/179#issuecomment-691195753, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI24CXJ3L7R5O3HEWLCPZLSFJGMJANCNFSM4PTMBBRQ .

krivard commented 4 years ago

Here's a plot of the number of days in each issue from 2 July to 10 Sept where county data for 72000 (the PR megacounty) or 72001 are available:

df <- suppressMessages(covidcast_signal("jhu-csse","confirmed_incidence_num","2020-07-01","2020-09-09",
                       "county",c("72000","72001"),issues=c("2020-07-02","2020-09-10")))
dfn <- group_by(df,geo_value,issue) %>% summarise(n=n())
ggplot(dfn, aes(x=issue, y=n, group=geo_value, color=geo_value)) + geom_line() + ggtitle("number of dates of available data")

So it looks like we supported PR mega counties through mid-July, then picked up on individual county data on 27 August.

In theory the diff-based issue generator should reissue PR county data back to 2 Feb as soon as it becomes available and valid; in practice we may have to babysit it a bit.

dshemetov commented 4 years ago

I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both?

krivard commented 4 years ago

Ah sorry, that's data versioning terminology. Issue like a magazine issue; a collection of data that was uploaded to receiving and published together. For daily signals, the issue date is a day. For weekly signals, the issue date is an epidemiological week ("epiweek"). More info in the API docs here https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#optional or the onboarding documentation for Engineering here https://docs.google.com/document/d/17WMyQQ-zGtVtB8GLaACxLOkbbUPscMyfweqc1FxW-74/edit?usp=sharing. Eventually we want all indicators to abide by a diff-based issue definition that includes only the rows that changed during the time period covered by the issue. Rows that stayed the same are not explicitly confirmed. Rows that were removed are not currently distinguished from rows that stayed the same; this will be addressed in a missingness encoding scheme TBD.

On Fri, Sep 11, 2020 at 6:10 PM Dmitry Shemetov notifications@github.com wrote:

I'm having trouble grokking this. My guess is it's because I don't know what issue means. Is that the date when the data was released? And is this definition something set by us or by JHU or both?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cmu-delphi/covidcast-indicators/issues/179#issuecomment-691333589, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI24CQDOG4IQJZ6EBLQHIDSFKN6DANCNFSM4PTMBBRQ .

krivard commented 3 years ago

The deploy-jhu branch generates Puerto Rico (PR) data correctly, which mean there were at least four possibilities: (1) the differ has a bug, (2) the differ is working correctly but the AWS cache is dirty and causing it to fail, (3) the patch job I added to drop the erroneous .00000 and 8xxxx counties has a bug, (4) something is causing the July-August files to fail validity checks.

Dmitry checked (1) ~and (2)~, and all seems well there. That leaves going through the success/failed archive on the server to see what was actually ingested, and cross-referencing with the ingestion log files to see which of those files was overwritten and when.

We are looking for:

Successful July/August files that mention PR regions. Since PR doesn’t show up in the latest-issue API results, and we don’t have a deletion mechanism in the API, we do not expect to find any such files.
Failed July/August files that mention PR regions, especially after the differ was activated. A failed file is not loaded into the API, but the differ doesn’t know that, so those changes would be erroneously removed from future deliveries to receiving.
Failed files that may have overwritten July/August files that could have mentioned PR regions. Same deal, but it requires a cross-reference with the Automation overwrite log first to get the list.

krivard commented 3 years ago

Successful July/August files that mention PR regions: As expected, no results, except for 31 August -- fair enough. We really just wanted to confirm the gap.

$ find archive/successful/jhu-csse/ -name "20200[78]*state*" -exec zgrep "^pr" {} + | grep -v "_wip_"
$ find archive/successful/jhu-csse/ -name "20200[78]*county*" -exec zgrep -m1 "^72" {} + | grep -v "_wip_"
archive/successful/jhu-csse/20200831_county_confirmed_7dav_cumulative_num.csv.gz:72001,119.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_7dav_incidence_num.csv.gz:72001,0.7142857142857143,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_cumulative_num.csv.gz:72001,121.0,NA,NA
archive/successful/jhu-csse/20200831_county_confirmed_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_7dav_incidence_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_cumulative_num.csv.gz:72001,0.0,NA,NA
archive/successful/jhu-csse/20200831_county_deaths_incidence_num.csv.gz:72001,0.0,NA,NA

Failed July/August files that mention PR regions: No state files, but county files for all dates (weird?)

$ find archive/failed/jhu-csse/ -name "20200[78]*state*" -exec grep -m1 "^pr" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
$ find archive/failed/jhu-csse/ -name "20200[78]*county*" -exec grep -m1 "^72" {} + | grep -v "_wip_" | sed 's/_.*//' | sort -u
archive/failed/jhu-csse/20200701
archive/failed/jhu-csse/20200702
archive/failed/jhu-csse/20200703
archive/failed/jhu-csse/20200704
archive/failed/jhu-csse/20200705
archive/failed/jhu-csse/20200706
archive/failed/jhu-csse/20200707
archive/failed/jhu-csse/20200708
archive/failed/jhu-csse/20200709
archive/failed/jhu-csse/20200710
archive/failed/jhu-csse/20200711
archive/failed/jhu-csse/20200712
archive/failed/jhu-csse/20200713
archive/failed/jhu-csse/20200714
archive/failed/jhu-csse/20200715
archive/failed/jhu-csse/20200716
archive/failed/jhu-csse/20200717
archive/failed/jhu-csse/20200718
archive/failed/jhu-csse/20200719
archive/failed/jhu-csse/20200720
archive/failed/jhu-csse/20200721
archive/failed/jhu-csse/20200722
archive/failed/jhu-csse/20200723
archive/failed/jhu-csse/20200724
archive/failed/jhu-csse/20200725
archive/failed/jhu-csse/20200726
archive/failed/jhu-csse/20200727
archive/failed/jhu-csse/20200728
archive/failed/jhu-csse/20200729
archive/failed/jhu-csse/20200730
archive/failed/jhu-csse/20200731
archive/failed/jhu-csse/20200801
archive/failed/jhu-csse/20200802
archive/failed/jhu-csse/20200803
archive/failed/jhu-csse/20200804
archive/failed/jhu-csse/20200805
archive/failed/jhu-csse/20200806
archive/failed/jhu-csse/20200807
archive/failed/jhu-csse/20200808
archive/failed/jhu-csse/20200809
archive/failed/jhu-csse/20200810
archive/failed/jhu-csse/20200811
archive/failed/jhu-csse/20200812
archive/failed/jhu-csse/20200813
archive/failed/jhu-csse/20200814
archive/failed/jhu-csse/20200815
archive/failed/jhu-csse/20200816
archive/failed/jhu-csse/20200817
archive/failed/jhu-csse/20200818
archive/failed/jhu-csse/20200819
archive/failed/jhu-csse/20200820
archive/failed/jhu-csse/20200821
archive/failed/jhu-csse/20200822
archive/failed/jhu-csse/20200823
archive/failed/jhu-csse/20200824
archive/failed/jhu-csse/20200825
archive/failed/jhu-csse/20200826
archive/failed/jhu-csse/20200827
archive/failed/jhu-csse/20200828
archive/failed/jhu-csse/20200829

These were uploaded on August 28, and include the invalid ".0000" region from #254.

krivard commented 3 years ago

It seems Dmitry checked that his output matches the production cache, but not whether the production cache was dirty.

The cache contains:

:white_check_mark: 72xxx records in county files
:white_check_mark: PR records in state files (though upper case, which is suboptimal)
:x: invalid counties (.0000, 8xxx, 9xxx) in the county files

The presence of the invalid county codes suggests a dirty cache. There are a couple of ways forward from here:

drop and regenerate the entire cache from API calls
surgically alter the cache to remove the PR records and activate the differ for tomorrow's run
drop only invalid cache files

dshemetov commented 3 years ago

Ah I saw the .0000 counties, but didn't realize they weren't supposed to be there! Feels good to have it narrowed down!

krivard commented 3 years ago

@eujing @korlaxxalrok do you have thoughts on the best way to reset the S3 cache as above?

eujing commented 3 years ago

I feel like the cleanest way would be to regenerate the entire jhu cache. We could run the indicator before tomorrow with the jhu S3 cache deleted, and then manually run it today to have it upload its complete output to S3. But two problems with this would be 1) The uploading process will take awhile as it does it serially 2) We might have to note down this event somewhere if we ever want to reconstruct anything from the S3 object versioning history

cmu-delphi / covidcast-indicators

JHU changed Puerto Rico death reporting, causing us to fail to report it #179