Closed mgraber closed 3 years ago
Draft updated 2019 metadata using reference table provided by population. Currently, all fields except for census_variable
are pulled from a join with the previous metadata json file. We need to fill in base_variable
, domain
, rounding
, source
for the 84 new variables. We also need to know which variables are no longer relevant to include (for example, many of the new language variables replace old language variables that we no longer have data for).
Created by:
import pandas as pd
df = pd.read_csv("ACS1519meta.csv")
meta_prev = pd.read_json("metadata.json")
df1 = df[df["schema"].notnull()][["schema","Name"]].rename(columns={"schema": "pff_variable", "Name":"census_variable"})
df2 = df[df["schema2"].notnull()][["schema2","Name"]].rename(columns={"schema2": "pff_variable", "Name":"census_variable"})
df3 = df[df["schema3"].notnull()][["schema3","Name"]].rename(columns={"schema3": "pff_variable", "Name":"census_variable"})
df4 = df[df["schema4"].notnull()][["schema4","Name"]].rename(columns={"schema4": "pff_variable", "Name":"census_variable"})
df_stacked = pd.concat([df1, df2, df3, df4])
df_stacked["pff_variable"] = df_stacked["pff_variable"].str.lower()
df_stacked["census_variable"] = df_stacked["census_variable"].str.replace("E", "")
df_stacked = df_stacked.groupby("pff_variable")["census_variable"].apply(list)
df_join = meta_prev.merge(df_stacked, how="outer", on="pff_variable")
df_join["census_variable"] = df_join["census_variable_y"].fillna(df_join["census_variable_x"])
meta_cur = df_join[["pff_variable","base_variable","census_variable","domain","rounding","source"]]
meta_cur.to_csv("meta2019.csv")
meta_json = meta_cur.set_index("pff_variable").to_json("metadata_2019.json", orient="index")
Metadata with base variables, domains, and notes: meta2019.csv To convert:
import pandas as pd
df = pd.read_csv("meta2019.csv")
# Remove capitalization
df["pff_variable"] = df.pff_variable.str.lower()
df["base_variable"] = df.base_variable.str.lower()
df["domain"] = df.domain.str.lower()
df["source"] = df.source.str.lower()
df["census_variable"] = df.census_variable.apply(eval)
# Added record is an exact duplicate of an existing record
df = df[(df.Notes != "remove") & (df.Notes != "added to list, variable change")]
df = df[["pff_variable","base_variable","census_variable","domain","rounding","source"]]
df.to_json("metadata_2019.json", orient="records")
#meta_json = df.set_index("pff_variable").to_json("metadata_2019.json", orient="index")
Using reference table in db-acs-2020 repo.