IDEMSInternational / R-Instat

A statistics software package powered by R
http://r-instat.org/
GNU General Public License v3.0
38 stars 103 forks source link

Simple gg mapping #3937

Open dannyparsons opened 7 years ago

dannyparsons commented 7 years ago

We need to include some very simple mapping in the procurement menu. We need to be able to plot a value, such as the mean corruption risk index or more simply, the number of contracts, for a country or regions in a country.

Firstly, we just need to know how to do this in R, with ggplot, probably using the ggmap package, but I'm not sure if that's the only ggplot option. Attached is a sample dataset which is just a count of contracts for a set of countries which I would like to be displayed as a typical "heat" map, like in the image below:

image

So the first task is to produce a simple map to this using ggplot for the sample data. Once we know this R code we can then think about a dialog.

dannyparsons commented 7 years ago

country_counts.zip

dannyparsons commented 7 years ago

Here's my current understanding of how to do the sort of mapping we would like using ggplot2.

ggmap is a nice package but not what we need at the moment. This is for pulling down maps from e.g. Google Maps. It requires internet access and is most useful for doing detailed maps like street level maps.

The maps package has a definition of country boundaries which can be reshaped ready for ggplot2 using ggplot2::map_data e.g.

world <- ggplot2::map_data("world")
head(world)
       long      lat group order region subregion
1 -69.89912 12.45200     1     1  Aruba      <NA>
2 -69.89571 12.42300     1     2  Aruba      <NA>
3 -69.94219 12.43853     1     3  Aruba      <NA>
4 -70.00415 12.50049     1     4  Aruba      <NA>
5 -70.06612 12.54697     1     5  Aruba      <NA>
6 -70.05088 12.59707     1     6  Aruba      <NA>

Then this can be plotted by: ggplot(world, aes(x = long, y = lat, group = group)) + geom_polygon() image

If you then have your own data frame with long and lat values for points of interest this is just adding a geom_point layer on top.

A heat map is more difficult, because this is done by adding fill to the polygon layer, which means merging with the world data.

So take the country counts data above

country_counts <- rio::import("C:/Users/Danny/Downloads/country_counts.csv")
head(country_counts)
       country count
1      Algeria    40
2       Angola     3
3        Benin     9
4     Botswana     2
5 Burkina Faso    24
6      Burundi    13

I can do a right_join to merge the counts and drop all the countries without a count:

mer <- right_join(world, country_counts, by = c(region = "country"))
      long      lat group order  region subregion count
1 8.576563 36.93721   486 35622 Algeria      <NA>    40
2 8.597656 36.88388   486 35623 Algeria      <NA>    40
3 8.601269 36.83393   486 35624 Algeria      <NA>    40
4 8.506739 36.78750   486 35625 Algeria      <NA>    40
5 8.444238 36.76074   486 35626 Algeria      <NA>    40
6 8.369629 36.63252   486 35627 Algeria      <NA>    40

Then plot this: ggplot(mer, aes(x = long, y = lat, group = group, fill = count)) + geom_polygon() image

If I do a left_join I keep all the outlines of the other countries:

mer <- left_join(world, country_counts, by = c(region = "country"))
ggplot(mer, aes(x = long, y = lat, group = group, fill = count)) + geom_polygon()

image

There's a few problems with this:

  1. Country names might not match when merging. The world data only has names, no ISO codes to match on. It could be very difficult to get the names consistent in your own data to match world.
  2. Often you would want something in between the two plots above, like the second plot but only showing Africa. The world data doesn't have categories like this so we can't easily subset for a continent.

There are other datasets that could help with these issues.

rworldmap::countryRegions lists countries and different region categories, which could be merged with world to then subset for a continent. However, this isn't straightforward because the names in each don't match perfectly. There is another dataset rworldmap::countrySynonyms which has different common names for countries which could be useful for merging.

To make these maps look nice we would want options like labels for the countries, adding capital cities with labels etc. Surprisingly, this isn't all straightforward in ggplot2. There is maps::world.cities which has details about cities which can easily be plotted. Again, the issue is subsetting because this may have to be merged to only plot cities in certain countries/regions. Adding labels to polygons is easy, but getting the positioning right might not be. There are complicated methods of finding the centroid of the polygon etc. to give good positions.

Interestingly, outside ggplot2 there are nice solutions to these. rworldmap::joinCountryData2Map can merge country data which uses rworldmap::countrySynonyms but can also use ISO codes. maps::map.cities() can add cities with labels to a plot and maps also has a nice option of adding country labels and does this in nice positions.

My impression is that these things are not yet standard and easy in ggplot2. I think it's still worth sticking with ggplot2 because of the advantages it has, but I was a bit surprised that there isn't yet standard and easy ways to do even simple maps.

Given that we only need a limited amount of mapping facilities for the procurement menu enhancements at this stage, I think the focus should be on delivering that, in a way that works, and not worry too much if it doesn't work completely generally at this stage. Although in the longer term, if we could do this well then I think we could be helping to fill a 'gap' by making some aspects of mapping in ggplot2 really easy.

rdstern commented 7 years ago

Very useful start to this topic. I also agree with the conclusion, namely that we stick with ggplot and try to get something working. This may omit some important aspects for now, but they can come later. Also, if some work has to be done in matching names for the maps to work then that has to be done - by us if general, but often by the user, because their data will be involved, and that is where the names may be inconsistent.

Some more comments in no particular order.

  1. Danny has been using geom_polygon above and that is clearly important. Some maps will have just that, for examples counties in Kenya together with information (a number) on each county (polygon) that can be mapped to a colour. There is also geom_map. I wonder how they compare. One article, said that he had compared methods and now used geom_polygon all the time. That's consistent with what Danny has done above.
  2. There seems to be quite a lot of useful information here https://eriqande.github.io/rep-res-eeb-2017/map-making-in-R.html#map-making-intro. This follows the R for Data Science book and has 3 sections on mapping, using ggplot2.
  3. There do seem to be some other useful packages we may wish to install. We may not need ggmap yet for other things, but apparently it has a theme_null that could be useful. Apparently the raster package is more general than just for raster-type data. And the relatively new sf (simple features) package looks important too. It also seems to indicate that adding mapping in R may well remain all that many users would need, who now go into feeling they need a gis system.
dannyparsons commented 7 years ago
  1. geom_map seems to be a wrapper for geom_polygon, but I couldn't get a clear understanding of the different. I didn't see anyone having strong opinions about either so I didn't look further but it would be good to know.

  2. That's a great link, and a good overview and it actually goes through the types of map that we need initially, useful for anyone who wants to know about ggplot mapping. I had seen these notes but not in this form, it looks very interesting and quite broad like the section on GitHub.

rdstern commented 7 years ago

I wonder if the difference between the map and polygon might just be useful for us?

I notice that in the ggplot2 documentation for geom_map the examples section says: " When using geom_polygon, you will typically need two data frames: one contains the coordinates of each polygon (positions), and the other the values associated with each polygon (values). An id variable links the two together."

So he has just copied the example text from polygon for map. But when you look at the example itself I just wonder whether we might have a use for both? It is possible that the geom_map might save the merge. Or perhaps we could avoid doing the merging anyway, because of our linking? That seems to be the one difference, and it might be important for us. Hadley Wickham says geom_map is faster too.

This is very speculative, but let me also try to consider how a dialogue (or set of dialogues) might look: a) I suggest it is likely to become more than one dialogue and it is an important topic. So, Describe > Mapping (or Maps, or Spatial) is after Describe > Specific. b) I wonder if we have a Define Spatial Data dialogue. This might be not be needed if your data is climatic or procurement, because it is defined there. c) Then (usually?) there will be (at least) 2 data frames - and that is relatively new for us. One is "ours" with our data, that we want to map. And (usually) there is also a more general data frame with country boundaries etc. That define spatial dialogue could specify the lat and long columns (and others if needed) based on our data, and also specify the general data frame associated with our data. d) Then I wonder if it would be useful to have a relatively simple dialogue to sort out the base for the map. This uses the general data frame(s). It could be quite simple for now, but might later include more data frames from ggmap or other sources. This would not use our own data, except that it would include the name of our data frame. So it could produce a base map object for the general data frame and also for our data frame, because of the links. e) Then we have the dialogue to add our own data. Here we know the lat and long columns from our definitions. We could have a single receiver for polygons and perhaps another for points (maybe even a third for contours?) f) Then we have our (usual?) sub-dialogues, i.e. perhaps up to 4, namely Polygon Options, Point Options, Contour Options and an overall one for Options - as now! g) I am conscious I don't have labels!