IFRCGo / go-api

MIT License
14 stars 6 forks source link

Admin2 Import #1492

Open geohacker opened 2 years ago

geohacker commented 2 years ago

After importing admin1 data and building a workflow to update geometries and attributes, and update Mapbox Vector tilesets we will now look at doing pretty much the same for admin2. https://github.com/IFRCGo/go-api/issues/470

Workflow

The workflow will remain largely same. We'll write management commands that can read a shapefile to create new admin2 geometries or update existing ones based on a new dataset. This ensures there's an easy way to fix incorrect or disputed geoms. The geometries will be stored in a separate table (similar to admin1) and not in the districts table. This means that it won't impact the performance of existing GO API endpoints.

We'll also add a query param for the API to fetch geometries and also write a script to update Mapbox tiles when needed.

Base data source

We think that the admin boundaries from FEWS is a good baseline. FEWS is a good dataset that is best of FAO-GAUL, GADM, the Humanitarian Data Exchange. HDX uses the UN OCHA datasets. FEWS also incorporates standard names from the GEONet Names database.

It's not perfect but with the workflow to be able to update easily, we should be able to fix issues as they are reported. We have had some good experience using FEWS in a few different projects.

@tovari In our dev catchup call the other day, you mentioned a few cases where FEWS wasn't reliable. Do you mind outlining them here? We can probably catch these early on and look for alternatives for those countries.

cc @batpad @LukeCaley @frozenhelium @szabozoltan69

tovari commented 2 years ago

The issue with FEWS is that it doesn't contain local names. E.g. in case of Ukraine no cyrillic names are available, only the English transliterations. The other problem is the coverage. It has a good covergae in Africa, but not on other continents. A list of admin2 layers per countries is shared here

geohacker commented 2 years ago

At the GO Sprint in Kathmandu we decided we'll go ahead with the OCHA CODs for admin2 that are published on geoboundaries.org. Since we will rely on CODs, it will allow us to import progressively without changing the data drastically quickly. We decided to start with an inspection of the data and how that lines up with the existing admin0 and admin1 data in GO. We also decided to consider importing countries in the Caribbean to start with.

I started looking at OCHA COD admin2 data for importing to GO. Here are some findings for Haiti and Kosovo:

geohacker commented 2 years ago

The issues illustrated above are not particularly surprising but something we needed to take a look at with good examples. This makes me feel like I think we should work towards an expectation of getting reliable admin2 boundaries into the database, without removing the admin1 and admin0 data that came from ICRC. Some thoughts:

  1. Changing the ICRC admin0 and admin1 data means we'll have to start from scratch in terms of the boundaries, disputed and overseas territories managed in GO right now
  2. This will have an impact on existing field reports as we change the admin1 dataset. We may have to manually remap or deprecate some of the old admin1s
  3. From what I can see, CODs aren't available consistently for all countries (see stats here https://cod.unocha.org/). This means we'll be mixing admin1s and admin2 which will lead to a lot of issues similar to Kosovo above. This will be misleading.

For the GO API and Risk Module use cases, I think we can do the following:

  1. Stick to admin2 from OCHA CODs
  2. Import admin2s country by country without replacing existing admin1s
  3. Ensure there's a proper mapping between admin2s to admin1. This could be done manually through workflows using qgis or tools used around DEEP
  4. Support admin2 map based selection tool on GO
  5. Visualize admin2 and admin1 in a mutually exclusive way in the style. This will still have some edge cases but will be largely ok
  6. Add a disclaimer about mixed data sources and documentation users can read to understand why this is the case.

cc @batpad @tovari @LukeCaley @justinginnetti

geohacker commented 2 years ago

Thanks for the productive discussion today @tovari @LukeCaley @justinginnetti @batpad. We are in agreement to move forward with the above approach — we won't replace all admin1s but only in cases were it's absolutely necessary due to reasons like:

In terms of next steps:

Over the next couple days, I'll update this ticket with progress.

geohacker commented 2 years ago

I'm continuing this work in #1557 PR.

Haiti

image

This lines up pretty well with admin1 data that's already in GO. So we don't need to replace that

Now to get the admin1 ID from GO into the Haiti admin2 shapefile, this is my workflow:

image
geohacker commented 2 years ago

Colombia

There are CODs available for Colombia. This is the workflow I used. The goal is to have an admin2 shapefile for Colombia that has the following attributes shapeName, pcode, admin1_id (which needs to derived like above from the GO admin1 data).

Inspect the admin2 and admin1 data against existing GO admin1

image image

Looks all good in terms of territories but some minor issues likely due to different geometry simplifications. So we don't need to change the admin1 data.

Match admin1 id to admin2 COD

Create centroids Centroids won't work really well for this matching due geometries like below image

For this admin2 polygon, the centroid is actually outside the geometry. One could use geometric center instead of centroid but it might be better to prepare random points inside the geometry for the matching.

Create random points inside polygon image

Set number of points as 1 in the dialog and create a new temporary layer.

Join the random points layer with admin1 layer to add district_id Follow steps outlined previously by using the Join Attributes by Location option. In the new joined random points layer, inspect the attribute table. image

Check if there are any NULL values by clicking the district_id column to sort it. In this case we can see there are two NULLs. Meaning for two admin2s we couldn't find an admin1 match. To inspect why that is, select the row and then click on 'Zoom map to selected rows' image

image

Now we can see that the point wasn't able get a match because it's sitting outside the admin1 boundary because of the minor geometry issue. In this case, it's easier to look up the admin1 geom and then edit the id column manually.

image

The ID is 642. To update, follow the steps below.

Now join this random points layer with the admin2 polygon layer using the join attributes by location tool. In the end, it's important to make sure all the join layers have the same feature count image

Finally, rename district_id to admin1_id and save as shapefile.

image

geohacker commented 2 years ago

I thought I'd look at the Ukraine admin2 that are getting a lot of movement on the HDX page https://data.humdata.org/dataset/cod-ab-ukr — the data i'm looking at is updated on October 11, 2022 ukr_adm_sspe_20221005.zipSHP

Looking at admin1 and admin2

image All good. Some minor polygon simplification issues but we can stick to our existing admin1 data.

image

admin2 also looks good.

image The column names are different so we have to make sure to rename.

I followed the same steps as above

  1. Create random points
  2. Join that with admin1 to add district_id to the random points layer, inspect
  3. Join the new joined random points layer to the admin2 polygon layer to add district_id to polygons, inspect
  4. Rename columns. name, code and admin1_id
  5. Export to shapefile
  6. 🎉

Checking an admin2 in the GO Admin image

geohacker commented 2 years ago

Same workflow as above for Venezuela image

tovari commented 2 years ago

@geohacker, would you mind to list the mandatory fields with types of the admin2 geo files? Should it be a shp, geojson, or something else?

geohacker commented 2 years ago

@tovari sure! Currently we support only shapefiles with mandatory fields name — name of the admin2 (or shapeName as in CODs), code — pcode (orpcode as in CODs), and admin1_id — which is the admin 1 ID from the GO database.

tovari commented 2 years ago

Thanks @geohacker! What optional fields can be added? I'm think about e.g. local_name and LN_lang_code, alternate_name and AN_lang_code. I'm not sure, if it makes sense to add an option for local admin ID, and for population data.

geohacker commented 2 years ago

At the moment, we don't have any other fields https://github.com/IFRCGo/go-api/blob/develop/api/models.py#L265-L272 — but we can certainly add to account for names in other languages. But that we should be consistent with how we are doing languages for admin1 and regions, with columns called name, name_es, name_ar, name_fr, name_en.

I think we should not store population data in the admin2 table. Because it needs to be updated more regularly perhaps. Ideally that data should live in a different table with pcode mapping so we don't have to worry about updating the geometries when we need to update population data. Only if there's an immediate use case.

tovari commented 2 years ago

Ok, agree on not including the population data.

I think, names on local language have an importance on lower admin levels as mostly there won't be en, es, fr, ar versions of the names. There might be transliterations to latin from other alphabets, but I think we should still preserve the local names written in the local alphabet. Alternate name and alphabet might be relevant as well in multi language countries. Thus we will have an option to store 2 versions of the names in 2 languages. name is the transliterated name to latin in this case, I assume.

geohacker commented 2 years ago

@tovari ok makes sense! I've just added local_name, local_name_code, alternate_name and alternate_name_code as optional fields. The import script will also look for the presence of these columns in the shapefile and import accordingly.

Just to note that the OCHA cod shapefiles we are importing do not have local name fields so currently all of them only have the default name field.

geohacker commented 2 years ago

The PR #1557 is now ready for review. So far, we have prepared and imported (locally):

Once the PR is merged, we can import these on staging to test. cc @batpad

geohacker commented 2 years ago

A workflow to import admin2 is now merged to develop. This also includes methods to create, update and publish mapbox tilesets. At the moment, there's a sample mapbox map style with some admin2.

image image

The process is documented in the README

tovari commented 1 year ago

I did the admin1-2 matching a bit differently to make sure we link admin2s to the correct admin1 even when there are significant deviations between OCHA and ICRC admin1s.:

  1. Create random points inside the Admin1 OCHA polygons.
  2. Spatial join that points with admin1 to add district_id to the random points layer
  3. Join (1:many) the new joined random points layer to the admin2 polygon layer based on Pcodes to add district_id to polygons, inspect
  4. Rename columns. name, code and admin1_id

Check:

  1. Create random points inside admin2 polygons

  2. spatial join that points with admin1 to add district_id (another one) to the random points

  3. check if that another district_id and the district_id from the transition process match. In case they don't, the admin2 center inside point is outside of the admin1 which should cover the admin2.

  4. List these admin2, with significant discrepancies, inspect the polygon borders

  5. Export admin2 to shapefile

The check method may not find all discrepancies, but it finds them with a good chance when a good part of the admin2 is out of ICRC admin1.

One sample of the detected discrepancy: image Admin1 update should follow in such cases.

cc: @geohacker, David, @jhenshall

nanometrenat commented 6 months ago

May 2024 - in "Ticket time" doc @davidmuchatiza advises this is still very much relevant and in progress