CenterOnBudget / getcensus

Load American Community Survey data from the U.S. Census Bureau API into Stata
https://centeronbudget.github.io/getcensus/
MIT License
15 stars 3 forks source link

Census GEOIDs incorrectly formatted #57

Closed martinjbraun closed 2 years ago

martinjbraun commented 2 years ago

Is your feature request related to a problem? Please describe. When importing block group data using getcensus, the variables tract and blockgroup are not formatted correctly. The variable tract is stored as long and should be str6. Practically, this is a problem as many Census tracts have leading 0's which are lost when tract stored as long instead of a string. Census block groups are always 1 digit, but blockgroup gets stored as str3 after running getcensus. There is no loss of information for block groups, but there is no reason blockgroup shouldn't be str1.

This would make it easy to generate, for example, 12 digit geoids for each block group gen geo12=state+county+tract+blockgroup without any errors due to tract or blockgroup being formatted incorrectly.

Describe the solution you'd like It would be great if geographic variables were stored as strings after running getcensus so the formatting matched the formatting from the census https://www.census.gov/programs-surveys/geography/guidance/geo-identifiers.html For example, it would be great if the following geographies always had these formats after running getcensus:

state str2 county str3 tract str6 blockgroup str1

I haven't checked other geographic levels, like zip code, but it would also be great if these had the correct formats after importing them. Having standardized formatting would help ensure users don't make mistakes later

Describe alternatives you've considered I have written my own code to reformat the variables after the fact, but I don't know if it works in all cases.

Additional context Consider the following example getcensus B25075_001, sample(5) years(2015) geography(bg) statefips(37) countyfips(183) clear

If you then run tab tract you'll see that some are 5 digits and some are 6 digits. By running describe tract we see that this is because tract is stored as long instead of str6.

If you run describe blockgroup you'll see blockgroup is stored as a str3 even though census blockgroups are always 1 digit.

c-zippel commented 2 years ago

Thanks for flagging. I will fix in the next release.

Here's what's causing the problem, in case you're interested: Under the hood getcensus has a list of the geography variables that are returned with a given geography(), and it avoids destringing those variables. I see that tract was omitted from that list for geography(bg). Once it's added, getcensus will no longer destring tract. The str3 formatting of blockgroup is residue from how getcensus imports and cleans up the API response. I think the best way for me to fix it is to add a simple compress at the end of that process.