DS4PS / cpp-528-spr-2020

Course shell for CPP 528 Foundations of Data Science III for Spring 2020.
http://ds4ps.org/cpp-528-spr-2020/
2 stars 0 forks source link

creating stable FIPS ids #14

Open lecy opened 4 years ago

lecy commented 4 years ago

@sunaynagoel

Fixing FIPS codes to avoid leading zero problems.

> # EXAMPLE NUMERIC IDs
> # eg. FIPS state-county-tract
> #
> x <- c( 001, 010, 100 )
> y <- c( 002, 020, 200 )
> z <- c( 000300, 030000, 300000 )
> 
> # leading zeros problem
> x
[1]   1  10 100
> 
> # manual solution:
> # add arbitrary large number
> # and extract digits with zeros
> x2 <- x + 100000
> x2
[1] 100001 100010 100100
> substr( x2, nchar(x2)-2, nchar(x2) )
[1] "001" "010" "100"
> 
> # use of formatC function
> # width is number of digits
> # format d returns a character
> # flag is what to add before existing characters
> 
> formatC( x, width = 3, format = "d", flag = "0" )
[1] "001" "010" "100"
> 
> xf <- formatC( x, width = 3, format = "d", flag = "0" )
> yf <- formatC( y, width = 3, format = "d", flag = "0" )
> zf <- formatC( z, width = 6, format = "d", flag = "0" )
> 
> # create new ID that starts with a character
> # and explicitly separates levels of the FIPS
> 
> paste( "fips", xf, yf, zf, sep="-" )
[1] "fips-001-002-000300" "fips-010-020-030000" "fips-100-200-300000"

You can easily parse this ID to extract state, county, and tract when needed.

Note that state FIPS is actually 2 digits, not 3.

You could create other levels by adding a leading character. This format perhaps?

s-## c-### t-######

If you convert a number to a character to solve the problem, then write the dataset to a CSV and re-load it the vector will be converted back to a number.

If there is a leading character you will never have the leading zeros problem. But you might need to remove the leading codes before combining into a unified FIPS.

The biggest issue is when someone creates a unified FIPS without resolving leading zeros:

fips <- paste0( state.id, county.id, tract.id )

Now if a FIPS code is less than 11 digits long you have no idea which zero was missing and you completely break the integrity of your data for a subset of observations. That's the type of problem you can't undo if you don't have the original sub-components of the IDs.

JaesaR commented 4 years ago

I'm a bit confused by these instructions:

st.fips <- state + 10000
st.fips <- substr( st.fips, 4, 5 )  # extract last two numbers 
ct.fips <- county + 10000
ct.fips <- substr( ct.fips, 3, 5 )  # extract last three numbers 
county.fips <- paste0( st.fips, ct.fips )

If my FIPS is: 17-031-010100, would the following code be correct?

st.fips <- 17 + 10000
st.fips <- substr( st.fips, 1, 7 )  # extract last two numbers 
ct.fips <- 031 + 10000
ct.fips <- substr( ct.fips, 3, 1 )  # extract last three numbers 
fips <- paste0( st.fips, ct.fips )
lecy commented 4 years ago
> # this doesn't work
> st.fips <- 17 + 10000
> substr( st.fips, 1, 7 )
[1] "10017"
> 
> # substring extracts part of a string
> # start = first position in the string
> # stop = last position in the string
> args( substr )
function (x, start, stop) 
NULL
> 
> x <- "aloysius snuffleupagus"
> substr( x, 10, 16 )
[1] "snuffle"
> 
> # leading zeros problem
> state <- 01
> county <- 030
> tract <- 999911
> paste( state, county, tract, sep="-" )
[1] "1-30-999911"
> 
> # should be 01-030-999911
> 
> # convert numeric vectors to character
> # with the leading zeros intact
> 
> substr( state+10000, start=4, stop=5 )
[1] "01"
> substr( county+10000, start=3, stop=5 )
[1] "030"
> 
> s.fixed <- substr( state+10000, start=4, stop=5 )
> c.fixed <- substr( county+10000, start=3, stop=5 )
> t.fixed <- substr( tract+100000000, start=4, stop=9 )
> 
> paste( s.fixed, c.fixed, t.fixed, sep="-" )
[1] "01-030-999911"