cmu-delphi / covidcast

R and Python packages supporting Delphi's COVIDcast effort.
https://delphi.cmu.edu/covidcast/
33 stars 28 forks source link

Provide examples or integration to produce state and county maps in Python package #18

Closed capnrefsmmat closed 4 years ago

capnrefsmmat commented 4 years ago

The R package provides mapping; the Python package does not. Admittedly it's not common for Python packages to provide their own plotting in the way it is for R packages to.

We should look at Cartopy and other relevant Python packages and see what it'd take to make mapping easy in the Python package. That need not mean implementing maps ourselves -- we could, for example, discover that by rearranging our data in a certain way, it's easy to provide to Cartopy. Then we can just provide the right adapter and give some examples of usage in the documentation.

This probably depends on #13.

chinandrew commented 4 years ago

I think something like geopandas should provide some of the choropleth functionality, and also seems to support bubble plots. I haven't used it before but will poke around.

Looks it just needs the shapefiles, which we could pull from the R data like you mentioned in #13, or grab from the census and other sources.

Any priority on what types of plots/geo boundaries are most important to have implemented first?

capnrefsmmat commented 4 years ago

I think ordinary choropleth plots at county and state levels are the first priority; metropolitan statistical areas can come later, and HRRs (healthcare referral regions) are optional.

Note one goofy feature of our counties. Counties have a unique 5-digit identifying FIPS code; the first two digits identify the state. Suppose we have data for most of the counties in a state, but for some counties the sample size is too small to report. We group those counties into a "megacounty" with code XX000, representing the "rest of" the state not covered by the counties that are reported. You can see this on the COVIDcast map for signals like Doctor's Visits and the symptom surveys. We'd want our county plots to support this.

Bubble maps are a second priority, I think.

chinandrew commented 4 years ago

Got it, thanks for the priorities and also for pointing out the megacounty thing.

Ended up spending most of my time today wrestling with getting the package installed and running. I had R 3.4, upgrade to 3.6, and then was getting an error on the foreign package which requires 4.0 so had to upgrade again. Also some DNS weirdness that led to timeouts, but I think that's just my setup. Linking in case it ever comes up for anyone else.

Below is my first pass at this with geopandas (top) compared to the R version (bottom) for the Aug 4 data, with the colors set to the closest default matplotlib colormap I could find. Need to polish that + styling, as well as figure out Alaska and Hawaii since they're both in their correct positions from the Census shapefiles and the projection is a bit wonky.

Based on the code/comments and running a few examples, looks like the current megacounty implementation in R sets all the "small counties" to the megacounty value, which I've replicated. I also see the TODO note to do this differently so you actually get one megacounty shape which I'll play around with a bit.

python_choropleth

R_choropleth

capnrefsmmat commented 4 years ago

Nice, that looks quite good.

I seem to recall it being possible to produce animations with matplotlib. I wonder if we could produce animations of the maps that show how the signals change over time...

A couple other points:

chinandrew commented 4 years ago

On shape data files, is it preferable to package with the releases? It's all pulled from the Census, so shouldn't be any license issues. Don't think it gets updated too often (yearly?) in which case I'm not sure it's worth having the client grab at runtime

chinandrew commented 4 years ago

I seem to recall it being possible to produce animations with matplotlib. I wonder if we could produce animations of the maps that show how the signals change over time...

That'd be cool, I can take a look after core functionality is done. In the meantime a slightly techy person should be able to just loop each day and combined the saved plots into a gif or something.

Some of our data sources provide values for Puerto Rico, so we should try to support that along with Alaska and Hawaii. Not sure if the R package does that currently. (If it doesn't, we should file an issue so we remember to add it.)

Got it, looks like there's no Puerto Rico data for doctor visits but there is for some of the FB survey data. I'm not sure the R package is plotting it (see screenshot). When I run fips_info(72) from the usmap package I get an invalid FIPS even though Puerto Rico is 72. Added as issue #24

Screenshot from 2020-08-09 22-53-58

While Ryan tried to make the color scale in the R package match the COVIDcast website perfectly, my feeling for the Python package is that we should make it integrate well with matplotlib/geopandas/etc. so the users can set color scales in the ways they're most familiar with. Then perhaps we can provide a custom default if we want.

Makes sense. My plan at the moment is to separate the methods that return the geoDF from the ones that actually plot the data in the geoDF, so users can use the former if they want a bit more control. We can set some defaults for the latter for the less technical users.

I've got Alaska and Hawaii (and Puerto Rico) working now. There's a bit of projection discrepancy between the website and the R package, and I'm not sure what exact projections are being used by either. All usmap docs say are that it's the Albers Equal Area, so I'm using the the following projections which seem to match the website but not the R Hawaii shape:

ESRI:102003 (USA Contiguous Albers Equal Area Conic) for the contiguous us and puerto rico ESRI:102006 (Alaska Albers Equal Area Conic) for alaska ESRI:102007 (Hawaii Albers Equal Area Conic) for hawaii

I've also tried ESRI:102008 (North America Albers Equal Area Conic) which doesn't change much for Hawaii.

Current plots w/ positions + sizes approximately near those on the website:

Hawaii and Alaska

Screenshot from 2020-08-09 23-53-34

+ Puerto Rico (different signal) Screenshot from 2020-08-10 00-24-22

I still have to get the color scale right, since I haven't replicated the R/website color breaks yet. I'll work on that and state plots next, which I'm hoping I can also reuse to do the "proper" megacounties.

capnrefsmmat commented 4 years ago

On shape data files, is it preferable to package with the releases? It's all pulled from the Census, so shouldn't be any license issues. Don't think it gets updated too often (yearly?) in which case I'm not sure it's worth having the client grab at runtime

Agreed; I think it's fine to include them in the package and use pkg_resources to load them out of the package when needed.

The maps are looking good. I think your approach of using the three different Albers projections is reasonable, since it prevents distortions of Alaska and Hawaii. It's unfortunate that usmap doesn't specify its projections further; that seems to be a common problem with packages that do mapping but don't realize how complicated it is to deal with projections. We should make a point of documenting our projections clearly.

ryantibs commented 4 years ago

Hi @chinandrew looks some great progress so far: nice work, and thank you! I'm just quickly popping in here to say that doing "megacounties the right way" is only going to be a lead to a tiny visual change that most people won't notice, so I don't think it should be a high priority. If you have interest + free cycles, then I think we could find several other issues that would be a higher priority.

chinandrew commented 4 years ago

Hi @chinandrew looks some great progress so far: nice work, and thank you! I'm just quickly popping in here to say that doing "megacounties the right way" is only going to be a lead to a tiny visual change that most people won't notice, so I don't think it should be a high priority. If you have interest + free cycles, then I think we could find several other issues that would be a higher priority.

Makes sense, will punt doing that for now.