gbif / pipelines

Pipelines for data processing (GBIF and LivingAtlases)
Apache License 2.0
40 stars 28 forks source link

Add GADM interpretation to records with coordinates #324

Closed MattBlissett closed 4 years ago

MattBlissett commented 4 years ago

To enable search and analysis by administrative region, add GADM at levels 0, 1, 2 and 3 to occurrences.

This should be an additional field, and not change or use dwc:stateProvince etc.

Depends on https://github.com/gbif/geocode/issues/6

timrobertson100 commented 4 years ago

Please consider that we anticipate GADM will be used a lot (e.g. create a choropleth map with an aggregation code, metrics for country pages aggregated by county) which may influence the ES structure necessary.

MattBlissett commented 4 years ago

GADM will give us this sort of data:

 type  │     id      │      source      │ title  │ isocountrycode2digit 
───────┼─────────────┼──────────────────┼────────┼──────────────────────
 GADM0 │ JPN         │ http://gadm.org/ │ Japan  │ JP
 GADM1 │ JPN.26_1    │ http://gadm.org/ │ Nagano │ JP
 GADM2 │ JPN.26.40_1 │ http://gadm.org/ │ Nagawa │ JP

The ids are clearly structured. @fmendezh, is there something ES can do with this, or should it be four fields (some countries will have GADM3)?


Do we want to index the title, or just the id?

timrobertson100 commented 4 years ago

I think we need code and title since we'll want to include this in the occurrence JSON response.

I could imagine all these being useful:

"GADM": {
  isoCountryCode:JP,      // to help spot possible data errors
  level0: {
    "code": "JPN",
    "title": "Japan"
  },
  level1: {
    "code": "JPN.26_1",
    "title": "Nagano"
  },
  level2: {
    "code": "JPN.26.40_1",
    "title": "Nagawa"
  }
}

We will definitely want to be able to search and aggregate counts by code, and I don't know if flattening data for ES will help with that (e.g. holding gadm0Code and gadm0Title).

MattBlissett commented 4 years ago

How important is the isoCountryCode?

GADM uses three letter codes (which we could map to two letters using Country), but it also includes an additional 7 custom codes:

 XAD   │ Akrotiri and Dhekelia
 XCA   │ Caspian Sea          
 XCL   │ Clipperton Island    
 XKO   │ Kosovo                 (NB we already have XKX from other sources)
 XNC   │ Northern Cyprus      
 XPI   │ Paracel Islands      
 XSP   │ Spratly Islands      

So these won't match up anyway (XCL would be FR from our Natural Earth interpretation). I could invent two-letter codes, or leave three-letter codes (but then there's little advantage over the level0 code).

MattBlissett commented 4 years ago

We have mixed-up GADM results on records: http://api.gbif-uat.org/v1/occurrence/1249992702 (contains Midjylland and Wellington).

This blocks deployment to production.

muttcg commented 4 years ago

Deployed to PROD