NYCPlanning / db-factfinder

data ETL for population fact finder (decennial + acs)
https://nycplanning.github.io/db-factfinder/factfinder/
MIT License
2 stars 3 forks source link

Update variable mappings for 2019 ACS #97

Closed mgraber closed 3 years ago

mgraber commented 3 years ago

Using reference table in db-acs-2020 repo.

mgraber commented 3 years ago

meta2019.csv

Draft updated 2019 metadata using reference table provided by population. Currently, all fields except for census_variable are pulled from a join with the previous metadata json file. We need to fill in base_variable, domain, rounding, source for the 84 new variables. We also need to know which variables are no longer relevant to include (for example, many of the new language variables replace old language variables that we no longer have data for).

Created by:

import pandas as pd 

df = pd.read_csv("ACS1519meta.csv")
meta_prev = pd.read_json("metadata.json")

df1 = df[df["schema"].notnull()][["schema","Name"]].rename(columns={"schema": "pff_variable", "Name":"census_variable"})
df2 = df[df["schema2"].notnull()][["schema2","Name"]].rename(columns={"schema2": "pff_variable", "Name":"census_variable"})
df3 = df[df["schema3"].notnull()][["schema3","Name"]].rename(columns={"schema3": "pff_variable", "Name":"census_variable"})
df4 = df[df["schema4"].notnull()][["schema4","Name"]].rename(columns={"schema4": "pff_variable", "Name":"census_variable"})

df_stacked = pd.concat([df1, df2, df3, df4])
df_stacked["pff_variable"] = df_stacked["pff_variable"].str.lower()
df_stacked["census_variable"] = df_stacked["census_variable"].str.replace("E", "")
df_stacked = df_stacked.groupby("pff_variable")["census_variable"].apply(list)

df_join = meta_prev.merge(df_stacked, how="outer", on="pff_variable")
df_join["census_variable"] = df_join["census_variable_y"].fillna(df_join["census_variable_x"])

meta_cur = df_join[["pff_variable","base_variable","census_variable","domain","rounding","source"]]

meta_cur.to_csv("meta2019.csv")
meta_json = meta_cur.set_index("pff_variable").to_json("metadata_2019.json", orient="index")
mgraber commented 3 years ago

Metadata with base variables, domains, and notes: meta2019.csv To convert:

import pandas as pd 

df = pd.read_csv("meta2019.csv")

# Remove capitalization
df["pff_variable"] = df.pff_variable.str.lower()
df["base_variable"] = df.base_variable.str.lower()
df["domain"] = df.domain.str.lower()
df["source"] = df.source.str.lower()
df["census_variable"] = df.census_variable.apply(eval)

# Added record is an exact duplicate of an existing record
df = df[(df.Notes != "remove") & (df.Notes != "added to list, variable change")]
df = df[["pff_variable","base_variable","census_variable","domain","rounding","source"]]

df.to_json("metadata_2019.json", orient="records")
#meta_json = df.set_index("pff_variable").to_json("metadata_2019.json", orient="index")