GSA / sdg-indicators-usa

U.S. National Reporting Platform for the Sustainable Development Goals
https://sdg.data.gov
MIT License
33 stars 92 forks source link

Open-source options for subnational requirements #858

Open brockfanning opened 6 years ago

brockfanning commented 6 years ago

@Kali2017SDG @philipashlock This is a write-up after some investigation into open-source options for adding subnational features to this platform. No action needed, just putting this up to help start the conversation, in case you'd like to consider an open-source vendor-less solution.

The requirements

Any subnational solution (as far as I can conceptualize it) needs to minimally satisfy 3 requirements:

  1. Data collection - ie, how will we get data on the indicators at the subnational level?
  2. Data management - ie, how will the subnational data be entered, maintained, and queried?
  3. Data visualization - ie, how will the data be displayed and interacted with?

1. Data collection

Requirement 1 above is the area I have the least insights into, so for the purposes of this write-up, I’m going to naively assume that we have access to all the data we need (hooray!).

2. Data management

Meeting requirement 2, a data management solution, should start with the basic question: Can we continue to use Github/Prose for data management, or does the introduction of subnational data require another approach?

This is a tough question, because it touches on administrative concerns like data provider workflows and user access and permissions; and it also touches on technical considerations like client-side performance.

Because “if it ain’t broke don’t fix it” I’m going to proceed with the assumption that we will try to continue using Github/Prose for data management, with the caveat that future requirements related to workflows, access, permissions, and client-side performance may necessitate a switch to a separate data management system.

As for the nitty gritty of the data management in Github/Prose, I’m assuming that it will done using a subfolder-based approach. For example, right now this platform uses a “data” folder, with one CSV file for each indicator. With a subfolder approach, there would be a subfolder for each subnational region (U.S. state), eg: data/state/alabama, data/state/alaska, etc. Each of these subfolders similarly have one CSV file for each indicator.

3. Data visualization

That leaves requirement 3, data visualization. For map visualizations, there seem to be 2 main types of open-source solutions out there: “vector-based”, and “tile-based”.

Tile-based

Tile-based solutions have the advantage of more choices of imagery (such as streets, terrain, satellite, etc.), but they carry an additional “moving part” by requiring the use of a tile server. There are free tile-servers for light use, like Open Street Maps, but it might be a complication.

Vector-based

Vector-based solutions outline the map and fill the regions with color. Since we will presumably be exclusively displaying “choropleth” maps, this is probably all that we need. So, between the 2, I lean towards using a vector-based solution.

Open-source mapping libraries

Here are a few well-maintained javascript libraries for both approaches:

  1. Leaflet is a tile-based, lightweight mapping library. Here is a choropleth example. Worth noting: the Tanzania and Mexico NRPs both use Leaflet.
  2. D3.js is a vector-based, general-purpose data visualization library. Here is a choropleth example. Worth noting: D3 is a very popular data-vis library with a strong following, and is the subject of many courses/tutorials/books.
  3. Open Layers is tile-based, with lots of features. However, I could not find a choropleth example. It may be more geared towards other uses.
  4. jQuery Mapael is vector-based, light-weight, and depends on another library, Raphael. Here’s a choropleth example.

Recommendations and next steps

Any next step would probably be to try a proof-of-concept with one or more of these libraries. As for recommendations, I’d would personally go with either Leaflet or D3, to start. Between those two I lean towards D3, because it doesn’t need a tile server. But ultimately I would recommend which ever one downloads the smallest amount assets needed to get the job done, with the most easily maintainable integration code. That wouldn’t be clear until trying them out.

As always, any feedback is welcome.

brockfanning commented 6 years ago

@Kali2017SDG @SmithersA @philipashlock I've put up an ongoing proof-of-concept using D3 on my fork, using the test data that Kali provided for 8-1-1. You can see it by clicking on the "Map" tab after going here.

A few notes:

Let me know if run into any trouble testing it. As always, feedback is welcome.

AnnCorp commented 6 years ago

Hi @brockfanning just wondering what did you use to produce the test map for 8.1.1.?

brockfanning commented 6 years ago

@AnnCorp That was done with D3. I have more work to do to make it truly nation-agnostic, but eventually it should be theoretically usable in the UK platform. It relies on the same data format (tidy) that you use. If you'd like a sneak peak at the code, the relevant files would be:

AnnCorp commented 6 years ago

Hi @brockfanning just wondering how things are going with this? Hoping the UK NRP developers will be exploring this kind of thing very soon and so planned to point them at this ticket and specifically your proof of concept http://brock.tips/sdg-indicators/8-1-1/. Anything else you think should be flagged up at the moment? Any further advice or information would be very welcome - thank you!

brockfanning commented 6 years ago

Hi @AnnCorp it's going well. I've got the proof-of-concept more abstracted now, so that the country-specific code is decoupled from the general mapping code. At this point I've stopped working on the functionality and have turned to getting more subnational data for the US. Here are the indicators done so far:

One statistical/tech note I wanted to mention about these - on some I noticed that extreme outliers could skew the coloring system in a way that made the choropleth map less useful. One example was North Dakota in 2012 on 8-1-1, and another example was District of Columbia for most of the years on 3-3-1. Because these outliers were so much higher than all the other regions, the map pretty much turned into only 2 colors: one for the outlier, and one for everything else.

The approach I've got in place on these proofs of concept is, for the purposes of the coloring system, to ignore any regions whose values are outside of 3 standard deviations from the mean. Here is the code that does that. The outliers are still displayed accurately on the map, but their values aren't included in the color legend.

Kali2017SDG commented 6 years ago

Explanation of North Dakota outlier (Survey of Current Business, July 2013. p.116... note: statistics have been revised since this initial release via incorporating available source data): Mining has increased in importance in North Dakota’s economy as a result of the oil boom due to the recovery of oil from the Bakken region’s shale formation; in 2009, mining accounted for 3.5 percent of North Dakota’s current-dollar GDP, and in 2012, mining’s share had nearly tripled, accounting for 9.6 percent of the state’s current-dollar GDP.

brockfanning commented 6 years ago

@Kali2017SDG Ah, good to know!