cmu-delphi / delphi-epidata

An open API for epidemiological data.
https://cmu-delphi.github.io/delphi-epidata/
MIT License
100 stars 68 forks source link

`flusurv` data is stale #1247

Open brookslogan opened 1 year ago

brookslogan commented 1 year ago

The version of FluSurv-NET data available appears to be from 2021-05-28, containing data through the epiweek labeled with date 2020-04-19, while the upstream source has data for the 2022/2023 flu season.

library(epidatr)
dat = flusurv(locations = "network_all", epiweeks = epirange(201701, 202301)) %>% fetch()
max(dat$release_date)
#> [1] "2021-05-28"
max(dat$epiweek)
#> [1] "2020-04-19"
names(dat)
#>  [1] "release_date" "location"     "issue"        "epiweek"      "lag"         
#>  [6] "rate_age_0"   "rate_age_1"   "rate_age_2"   "rate_age_3"   "rate_age_4"  
#> [11] "rate_overall"

Created on 2023-07-26 with reprex v2.0.2

FluSurv-NET acquisition broke circa 2020-10-09 but was patched to ignore age groups that were introduced then. From the above sample, it looks like these/other age groups are still being ignored; upstream has 2 top-level age groups, 5 subgroups, and 8 subsubgroups; the API returns only 5 age groups. I believe age group changes may have broken flusurv acquisition at some other point in time as well, so that might be a top suspect for the current breakage.

The fluview* outage might be too late to be related. FluSurv-NET reporting is not year-round; it typically starts at some point during the flu season when activity levels / influenza hospitalization numbers are deemed high enough (I don't remember the precise rule) and ends at/after the end of the flu season, with some break in issues and/or gap in measurements before the next season.

brookslogan commented 1 year ago

Some other notes: upstream source:

nmdefries commented 1 year ago

Looking at the oldest and newest logs available for flusurv acquisition (can't find any before April 2023):

we always see "current issue: 202111" and

rows before: 212669
rows after: 212669 (+0)

So issues newer than March 2021 are not available/being fetched, and thus no new data is added to the DB. Although we don't have older logs, I'd assume that this has been going on since May 2021, which is the latest release_date available in our flusurv data.

This matches my local testing. get_current_issue() returns 202111 using a magic URL. I wonder if we were supposed to switch to a different magic URL.

There are no errors; the pipeline appears to have been running successfully this whole time.

nmdefries commented 1 year ago

The cdcfluview package successfully gets up-to-date data (loaddatetime is Aug 12, 2023; data is available through 2023w17), so it looks like the Flu3 endpoints changed. The old ones still work but aren't returning new data.

The CDC GIS GRASP API/AMF server appears to be meant solely for internal use -- I can't find any documentation ~and can't find the new https://gis.cdc.gov/GRASP/Flu3/PostPhase03DataTool endpoint myself. I assume it is used somewhere in the dashboard source.~ The new https://gis.cdc.gov/GRASP/Flu3/PostPhase03DataTool endpoint can be found by loading (in Chrome) the source dashboard, turning on the inspector, going to the Network tab, and reloading or otherwise interacting with the page. API queries will show up as network requests.

The returned data has also changed format, so we'll need to update flusurv.extract_from_object() to account for that. ~It doesn't appear possible to request specific locations from the new endpoint.~ Edit: You can request locations with a payload like

{
  "appversion": "Public",
  "key": "getdata",
  "injson": [
    {
      "seasonid": 62,
      "networkid": 2,
      "catchmentid": 22
    }
  ]
}

Not sure if seasonid is required.

To avoid fetching the same (large) JSON multiple times, my recommendation is to fetch once upfront in main(), and pass the resulting JSON to get_current_issue and get_data functions to extract data of interest. This also avoids different endpoints potentially returning data from different loaddatetime. (Currently we use GetPhase03InitApp for the loaddatetime and PostPhase03GetData for the location data. With no documentation, it's hard to guarantee their behavior.)

We should also consider how to avoid this type of issue in the future. Based on our mirror of the historical data, the source is updated infrequently so it's possible that we'd see fairly long periods without any data updates. This means that we can't just error out if no new data is returned.

nmdefries commented 1 year ago

On recovering versioned data, again, because of the lack of documentation it's unclear to me if it's possible to request data from a particular loaddatetime. Edit: According to our CDC contacts, this is not possible.

nmdefries commented 1 year ago

Since fluview* signals pull from another CDC dashboard, we should double-check that those signals are updating correctly. Their CDC API endpoints may also have changed.

brookslogan commented 1 year ago

The fluview outage also involved reporting the wrong "current" epiweek, but apparently was due to writing/querying from the wrong server/table. Doesn't seem like that is the case here, but I'm not 100% certain.

The fluview ones use Phase02 rather than Phase03, but still sounds like a good idea to check!

nmdefries commented 1 year ago

In the motivating example,

library(epidatr)
dat = flusurv(locations = "network_all", epiweeks = epirange(201701, 202301)) %>% fetch()
...
names(dat)
#>  [1] "release_date" "location"     "issue"        "epiweek"      "lag"         
#>  [6] "rate_age_0"   "rate_age_1"   "rate_age_2"   "rate_age_3"   "rate_age_4"  
#> [11] "rate_overall"

the result doesn't have all the value columns we'd expect. Acquisition attempts to update these columns, plus rate_age_5, 6, and 7. Float fields need to be updated.