degauss-org / census_block_group

A docker container for assigning census block group id to geocoded addresses.
https://degauss.org/census_block_group
GNU General Public License v3.0
4 stars 2 forks source link

Some LAT/LON Pairs Appear to Return Multiple fips_block_group_id_2020 values #13

Closed camachop-dbhi closed 2 years ago

camachop-dbhi commented 2 years ago

Hi @cole-brokamp @erikarasnick @andrew-vancil

Using this package I've noticed that there are some (albeit rare) instances where a given LAN/LON pair will return more than 1 fips_block_group_id_2020 value (i.e. 1 row of input will return more than 1 row of output).

This seems to be a pretty rare occurrence, as in my file of around 3.4 Mil addresses only around 4 thousand had multiple different fips_block_group_id_2020 values assigned. However, I would have expected each distinct LAN/LON to be associated with no more than one fips_block_group_id_2020 value.

Is this something that any of you have seen before (i.e. is it expected)? And or would this be something that you'd be able to recommend a fix for?

Best, -Pete

cole-brokamp commented 2 years ago

Hi Pete,

I believe this might be due to when a point lands exactly on the boundary between two census tracts. I have seen the R code that we use for this operation do this before (https://github.com/degauss-org/census_block_group/blob/c85514a38b9ba724b66973c57f8332c5f4818674/census_block_group.R#L54). Is there an specific example of a lat/lon coordinate that produces two census tracts that you could share with us so we can troubleshoot?

This could be considered "expected" from our point of view, but I think it would be better to figure out a better way to alert the user or return only one identifier. It seems like a problem that the container can return more than one row of output for row of input -- I would not expect this as a user.

cole-brokamp commented 2 years ago

We've run into the same issues where a geocoding result itself can result in a tie, but we always return the first returned result:

https://github.com/degauss-org/geocoder/blob/9837c8543cf42d7f3bb5f77da009422d60c6a236/geocode.R#L70-L91

I'm wondering if it wouldn't be a good idea to do the same thing here.

camachop-dbhi commented 2 years ago

Hey @cole-brokamp ,

Thanks for the quick follow-up!

I'm actually using this process in connection with EMR/EHR data so, in this situation, I think the Lan/Lon values I have would count as PHI and would prevent me from sharing any explicit examples that I have that are resulting in this behavior (unfortunately which does make this troubleshooting a bit difficult ☚ī¸ lol sorry!).

However, I can attempt to come up with some artificial (non-PHI) examples on my own that re-produce this behavior if you think that would be helpful (though this might be a bit tricky). Let me know if this is something you would need to for this troubleshooting.

All that said, if the approach you mentioned in the geocoder package would be able to be applied here, to systematically resolve the occurrence of multiple "block group" and/or "track" ids for one line of input (and resolve the tie by returning the first record), that would be awesome 😄/

cole-brokamp commented 2 years ago

Ah, yes the curse of PHI data and not being able to reproduce any problems publicly... i know this problem well 😄

I do think we will implement this fix, but I would love to find a reproducible example to map/show and make sure this is the issue. I will try to create one myself and look more into it.

Thanks again for reporting this.

camachop-dbhi commented 2 years ago

Okay sure thing, thanks @cole-brokamp !! I'll try to find a reproducible example that I am able to share as well and will send along if I am able to find anything 👍 ✨

cole-brokamp commented 2 years ago

I've come up with a helpful example. Making two square polygons and some points for example joins:

library(sf)

pts <-
  st_sfc(
    st_point(c(.5,.5)),
    st_point(c(1,.5)),
    st_point(c(1,0))) |>
  st_as_sf() |>
  dplyr::mutate(id = 1:3)

pol <- st_sfc(
  st_polygon(list(rbind(c(0,0), c(1,0), c(1,1), c(0,1), c(0,0)))),
  st_polygon(list(rbind(c(1,0), c(2,0), c(2,1), c(1,1), c(1,0))))
  )
pol <- st_as_sf(pol)
pol$id <- c("A", "B")

library(ggplot2)

ggplot(pol) +
  geom_sf() +
  geom_sf(data = pts)
image

Running st_join(pts, pol) (as is done in census_block_group and st_census_tract) returns duplicates for the points that are either on a shared polygon vertex or on a shared boundary:

st_join(pts, pol)

## Simple feature collection with 5 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.5 ymin: 0 xmax: 1 ymax: 0.5
## CRS:           NA
##     id.x id.y               x
## 1      1    A POINT (0.5 0.5)
## 2      2    A   POINT (1 0.5)
## 2.1    2    B   POINT (1 0.5)
## 3      3    A     POINT (1 0)
## 3.1    3    B     POINT (1 0)

Specifying the argument largest=TRUE leads to only one result being returned by point.

st_join(pts, pol, largest = TRUE)

## Simple feature collection with 3 features and 2 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 0.5 ymin: 0 xmax: 1 ymax: 0.5
## CRS:           NA
##   id.x id.y               x
## 1    1    A POINT (0.5 0.5)
## 2    2    A   POINT (1 0.5)
## 3    3    A     POINT (1 0)
## Warning message:
## attribute variables are assumed to be spatially constant throughout all geometries 

It calculates the largest area (and the accompanying warning message) to join only one feature from polygon to point... not sure how it is making this choice based on the intersections of a point with a polygon, but it can't be worse than picking the "first" one.

@erikarasnick -- what do you think of this? any downsides to just using largest = TRUE for any st_join in all containers?

erikarasnick commented 2 years ago

@cole-brokamp using largest = TRUE was my first thought when I read this issue. The only downside is I think it does add some computation time. With such a small example the difference in computation time is so minimal that I can't tell if there is a difference, but I'm not sure how/if it would affect a larger dataset.

cole-brokamp commented 2 years ago

Hi @camachop-dbhi, can you please try to rerun your large address file using a development version of the container? (We are trying this in #14)

You can call the updated container by replacing degauss/census_block_group:0.4.0 with ghcr.io/census_block_group:largest. This should pull and run the updated container.

Please let us know if it fixes the "ties" issue and also if there is a noticeable difference in amount of time it takes to run.

camachop-dbhi commented 2 years ago

Hey @cole-brokamp ,

Sure thing! I'll try running this later today and will let you know by tomorrow if it resolves the "tie" issue I had been seeing (also if there is a "noticeable" difference in run time).

camachop-dbhi commented 2 years ago

Hey @cole-brokamp ,

Trying to pull this updated container but getting the below error:

[IN]

docker pull ghcr.io/census_block_group:largest

[OUT]


Trying to pull repository ghcr.io/census_block_group ... 
unexpected http code: 400, URL: https://ghcr.io/token?scope=repository%3Acensus_block_group%3Apull&service=ghcr.io

Any thoughts on how I might be able to resolve this?

cole-brokamp commented 2 years ago

I made a mistake and the pull should be docker pull ghcr.io/degauss-org/census_block_group:largest. I forgot the degauss-org part to specify the organization!

camachop-dbhi commented 2 years ago

Cool thanks! That worked 👍 Testing now, should be able to communicate the results of this testing tomorrow

camachop-dbhi commented 2 years ago

Hey @cole-brokamp

Just had a chance to check in on the ghcr.io/degauss-org/census_block_group:largest process I set to run yesterday and it looks like the resulting output file was free of duplicates 👍 ✨

Testing Run Time Details

3,210,310 addresses were provided as input and 3,210,310 records were returned as output.

The run time for this test was around 6 hours and 29 minutes, which isn't significantly longer than any prior run times.

All that said looks like this update had the desired effect of breaking any ties for census block group values in the data

Thanks again for putting this together so fast! And feel free to reach out if you have any additional quests for me before pushing this out 😄/

erikarasnick commented 2 years ago

@camachop-dbhi Thanks for testing! We have released a new version of the container that officially incorporates these changes. It can be called using

ghcr.io/degauss-org/census_block_group:0.4.1

which is updated on the readme as well.