cmu-delphi / covidcast-indicators

Back end for producing indicators and loading them into the COVIDcast API.
https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html
MIT License
12 stars 17 forks source link

Make NCHS data available at HHS, nation level #1041

Open capnrefsmmat opened 3 years ago

capnrefsmmat commented 3 years ago

The NCHS mortality data is currently only available at the state level. It seems like it should be possible to aggregate it to the nation and HHS levels. (If it's not possible for some reason, we should document that so nobody tries to use state and aggregate themselves.) Having all our signals consistently available at HHS and nation when possible would make it easy to compare things.

There's no pressing need for this that I know of; I just noticed the inconsistency and think it'd be nice to fix.

krivard commented 3 years ago

Hey @alexcoda want to take a look at this? Cheryl has a good start in #1213 but it's missing the critical plumbing to actually do the geographic aggregations (it currently just outputs extra copies of the state df under different names).

I've attached a csv file which is the result of recently running pull.pull_nchs_mortality_data with our Socrata key for you to use for testing.

socrata_df.csv

alexcoda commented 3 years ago

@krivard yep! I'll take a crack at it sometime this weekend. I'll let you know if I have any more questions about it

chinandrew commented 3 years ago

Looking at the csv columns, there's a mix of counts and percent values (from the source), so presumably we'll have to do something like the following to do a weighted average of the percentages and a nonweighted sum for the counts? And NCHS should cover all states that are included in HHS regions, so we don't need to worry about weird denominator handling right?

df"weight"] = df["population"]
proportion_vals = gmpr.replace_geocode(df, "state_id", new_geo, ... , date_col="timestamp", data_cols=[<all columns that are a percentage>]
# weight column gets removed
count_vals = gmpr.replace_geocode(df, "state_id", new_geo,..., date_col="timestamp", data_cols=[<all columns that are a counts>]
# combine the two dataframes back

We ended up going down a huge rabbit hole after realizing geomapper doesnt do state_id to fips, only state_code, but will continue to finish this next week.

krivard commented 3 years ago

presumably we'll have to do something like the following to do

that looks right, yes.

NCHS should cover all states that are included in HHS regions

That's correct; here's the NCHS coverage map for reference

(and the like five different ways of specifying states is indeed a pain)