Closed capnrefsmmat closed 4 years ago
That was an Easter egg to acknowledge my Canadian heritage!
Just kidding. I'll think I know why this is happening, I'll try to get to this sometime this weekend, as well as #73 and Andrew's comments in #43.
@ryantibs FYI I have half an attempt to fix #73 in progress, except in my fix, the legend scales by area and the actual plot still seems to scale by radius... not sure what's going on. So I guess feel free to make a separate attempt to fix it, because I'm currently stumped.
So this turns out to be a bit different than what I thought originally. It looks like a bug with usmap::usmap_transform()
That misplaced bubble looks to be Aleutians West:
STATE |
CWA |
COUNTYNAME |
FIPS |
FE_AREA |
LON |
LAT |
|
---|---|---|---|---|---|---|---|
3297 | AK | AFC | Aleutians West | 02016 | NA | -111.1392 | 52.8883 |
The usmap
package tucks Alaska in nicely in the bottom left (next to where it places Hawaii) but somehow gets this wrong. I also think that the extra counties that are floating around the coast of Hawaii are could also be Alaska errors. Many of the Alaska counties look kind of misaligned.
It looks like the usmap package specifies a bounding box for Alaska and gives special treatment to everything in that box: https://github.com/pdil/usmap/blob/master/R/transform.R#L87
Is it possible that the Aleutians don't entirely fit inside that bounding box? If so, we could report a bug in their repo and give the example of the island that's not in the box.
I think the issue is actually in our county_geo dataset:
> covidcast::county_geo[covidcast::county_geo$FIPS == "02016",]
STATE CWA COUNTYNAME FIPS FE_AREA LON LAT
3297 AK AFC Aleutians West 02016 <NA> -111.1392 52.8883
The dot is placed at that latitude and longitude on the map. My theory is that the centroid is miscalculated because the Aleutians are long enough that they extend into the Eastern hemisphere, causing the numeric representation to change from negative to positive.
Huh. So the process that produces our county geographies produces a bad one. That data is produced here: https://github.com/cmu-delphi/covidcast/blob/main/R-packages/data-raw/make.R#L35-L43
Perhaps our data source (the shapefile in that directory) got it wrong, and we need to correct the data frame?
I checked and the error is present in our data source.
Since we're using usmap
for our mapping purposes, it might fix some of our alignment issues if we just used their county locations as well. I see they have a csv with centroids here:
https://github.com/pdil/usmap/blob/e716aa3e7191dd576b198956ca769b012b8af37d/data-raw/maps/us_counties_centroids.csv
That sounds like a good option. Can you check if their data file covers the same FIPS codes as ours, and if so, switch us to use it?
Can do. Can you assign this issue to me? I don't think I have permission.
It looks like the counties are the same fips codes.
There are some differences in how counties are named, mostly things like City of Foo
vs. Foo City
or St. Foo
vs Saint Foo
.
The more systematic difference is that we store names as Foo
and usmap stores them Foo County
. This is meaningful for locations that are not counties. For example, on our public facing maps we incorrectly display the names of most locations in Alaska and Louisiana, which are not counties but boroughs and parishes respectively. usmap is also more accurate in this regard.
Also, we sometimes have multiple locations for the same FIPS code. FIPS code 36085 (Richmond, NY) is listed twice at slightly different latitudes and longitudes. So, usmap definitely seems like the more reliable source. We'll just need to be careful not to cause any bugs with name parsing.
That sounds good. Note that the COVIDcast website doesn't depend on the data in the R package; it has its own data files listing counties and their locations. So you need not worry about breaking the website.
I think it makes sense to go ahead and switch the R package completely to usmap
, provided it has everything we need, so we can make geography into somebody else's problem.
Thanks Jed for all the progress here! So now I take it that we're not using the old county_geo
and state_geo
at all, for anything?
I peaked at the details of your PR #122, and made a comment about how you could just overwrite the county_geo
and state_geo
data frames and with the new geo info from the usmap
package. That way you don't have to read from csv each time you do plotting.
Agreed. I left more comments on the pull request. If usmap made the centroid data available through another means, I might suggest deleting the county_geo
and state_geo
files entirely to simplify things. However, I think your suggestion to use them essentially to cache data improves our stability since we're relying on internal details of usmap now.
We are all good her; I just merged PR #122. Thanks Jed.
There's a county being placed somewhere in Canada in the bubble plots (just under "confirmed"):