cmu-delphi / covidcast

R and Python packages supporting Delphi's COVIDcast effort.
https://delphi.cmu.edu/covidcast/
33 stars 27 forks source link

County out of place in R bubble plots #76

Closed capnrefsmmat closed 4 years ago

capnrefsmmat commented 4 years ago

There's a county being placed somewhere in Canada in the bubble plots (just under "confirmed"):

Screen Shot 2020-09-11 at 7 29 31 PM
ryantibs commented 4 years ago

That was an Easter egg to acknowledge my Canadian heritage!

Just kidding. I'll think I know why this is happening, I'll try to get to this sometime this weekend, as well as #73 and Andrew's comments in #43.

capnrefsmmat commented 4 years ago

@ryantibs FYI I have half an attempt to fix #73 in progress, except in my fix, the legend scales by area and the actual plot still seems to scale by radius... not sure what's going on. So I guess feel free to make a separate attempt to fix it, because I'm currently stumped.

ryantibs commented 4 years ago

So this turns out to be a bit different than what I thought originally. It looks like a bug with usmap::usmap_transform()

That misplaced bubble looks to be Aleutians West:

  STATE CWA COUNTYNAME FIPS FE_AREA LON LAT
3297 AK AFC Aleutians West 02016 NA -111.1392 52.8883

The usmap package tucks Alaska in nicely in the bottom left (next to where it places Hawaii) but somehow gets this wrong. I also think that the extra counties that are floating around the coast of Hawaii are could also be Alaska errors. Many of the Alaska counties look kind of misaligned.

image

capnrefsmmat commented 4 years ago

It looks like the usmap package specifies a bounding box for Alaska and gives special treatment to everything in that box: https://github.com/pdil/usmap/blob/master/R/transform.R#L87

Is it possible that the Aleutians don't entirely fit inside that bounding box? If so, we could report a bug in their repo and give the example of the island that's not in the box.

JedGrabman commented 4 years ago

I think the issue is actually in our county_geo dataset:

> covidcast::county_geo[covidcast::county_geo$FIPS == "02016",]
     STATE CWA     COUNTYNAME  FIPS FE_AREA       LON     LAT
3297    AK AFC Aleutians West 02016    <NA> -111.1392 52.8883

The dot is placed at that latitude and longitude on the map. My theory is that the centroid is miscalculated because the Aleutians are long enough that they extend into the Eastern hemisphere, causing the numeric representation to change from negative to positive.

capnrefsmmat commented 4 years ago

Huh. So the process that produces our county geographies produces a bad one. That data is produced here: https://github.com/cmu-delphi/covidcast/blob/main/R-packages/data-raw/make.R#L35-L43

Perhaps our data source (the shapefile in that directory) got it wrong, and we need to correct the data frame?

JedGrabman commented 4 years ago

I checked and the error is present in our data source.

Since we're using usmap for our mapping purposes, it might fix some of our alignment issues if we just used their county locations as well. I see they have a csv with centroids here: https://github.com/pdil/usmap/blob/e716aa3e7191dd576b198956ca769b012b8af37d/data-raw/maps/us_counties_centroids.csv

capnrefsmmat commented 4 years ago

That sounds like a good option. Can you check if their data file covers the same FIPS codes as ours, and if so, switch us to use it?

JedGrabman commented 4 years ago

Can do. Can you assign this issue to me? I don't think I have permission.

JedGrabman commented 4 years ago

It looks like the counties are the same fips codes.

There are some differences in how counties are named, mostly things like City of Foo vs. Foo City or St. Foo vs Saint Foo. The more systematic difference is that we store names as Foo and usmap stores them Foo County. This is meaningful for locations that are not counties. For example, on our public facing maps we incorrectly display the names of most locations in Alaska and Louisiana, which are not counties but boroughs and parishes respectively. usmap is also more accurate in this regard.

Also, we sometimes have multiple locations for the same FIPS code. FIPS code 36085 (Richmond, NY) is listed twice at slightly different latitudes and longitudes. So, usmap definitely seems like the more reliable source. We'll just need to be careful not to cause any bugs with name parsing.

capnrefsmmat commented 4 years ago

That sounds good. Note that the COVIDcast website doesn't depend on the data in the R package; it has its own data files listing counties and their locations. So you need not worry about breaking the website.

I think it makes sense to go ahead and switch the R package completely to usmap, provided it has everything we need, so we can make geography into somebody else's problem.

ryantibs commented 4 years ago

Thanks Jed for all the progress here! So now I take it that we're not using the old county_geo and state_geo at all, for anything?

I peaked at the details of your PR #122, and made a comment about how you could just overwrite the county_geo and state_geo data frames and with the new geo info from the usmap package. That way you don't have to read from csv each time you do plotting.

JedGrabman commented 4 years ago

Agreed. I left more comments on the pull request. If usmap made the centroid data available through another means, I might suggest deleting the county_geo and state_geo files entirely to simplify things. However, I think your suggestion to use them essentially to cache data improves our stability since we're relying on internal details of usmap now.

ryantibs commented 4 years ago

We are all good her; I just merged PR #122. Thanks Jed.