developmentseed / geospatial-ds-cholera-lab

A repo dedicated to developing a geospatial data science prototype (see issue: https://github.com/developmentseed/labs/issues/292)
10 stars 2 forks source link

Identify total number of district (admin2) level outbreaks in dataset #4

Closed kathrynberger closed 1 year ago

kathrynberger commented 1 year ago

According to the literature associated with this dataset there are a total of 692 outbreaks geocoded at the district (admin 2) level.

After successfully deriving district name from the location column, we used fuzzy matching to link the new district name to a georeferenced location we could map. However, there are still values where the fuzzy matching resulted in none values. (This work is demonstrated in the notebook on @chuckwondo's branch)

What we want to understand is what is the total number of district values remaining that we can use to develop our model? (i.e., it will be less than 692, but how much less?)

chuckwondo commented 1 year ago

I looked into the issue a bit more and discovered that there are a lot of unmatched admin2 entries.

At first pass, this appears to be due to many admin2 values actually being admin3 names, so of course fuzzy matching won't match the "bad" admin2 values since we need to be matching them against admin3 names, and then backtracking those matches to their actual admin2 names.

For example, here is the top of the CSV file that I wrote the matches to:

ISO3,admin2,NAME_2
BEN,ABOMEYCALAVI,Abomey-Calavi
BEN,AGUEGUES,Aguégués
BEN,BONOU,Bonou
BEN,DJOUGOU,DjougouRural
BEN,SAVALOU,Savalou
BEN,SOAVA,Sô-Ava
CIV,ABIDJANVILLE,Abidjan
CMR,BAFIA,
CMR,BIBEMI,
CMR,BIYEMASSI,
CMR,BOGO,Boyo
CMR,BONASSAMA,
CMR,BOURHA,
CMR,BUEA,
...

Some are matched, but many are not, so the NAME_2 column doesn't have a value.

I looked up the first row with a missing NAME_2 value (Bafia in Cameroon) and discovered that Bafia is a town/commune, which makes it an admin3 name, not an admin2 name. More specifically the hierarchy is

Therefore, for this particular line in the CSV, the value of NAME_2 should be Mbam-et-Inoubou.

I'm pretty confident that this is the biggest issue with being able to match names, so I'll focus on the logic for finding such matches.

chuckwondo commented 1 year ago

Thinking about this further, I'm wondering how many rows where spatial_scale is "admin2" are mislabeled, and should actually be marked as "admin3". Therefore, it might we worth exploring 2 avenues here:

  1. continue at the "admin2" level, but matching "bad" admin2 names to actual admin3 names and then obtain the correct admin2 names from the breadcrumbs in the geojson feature properties
  2. do the same matching as in the previous item, but place the matched values in the "admin3" column and use them in conjunction with the rows that are already labeled with a spatial_scale of "admin3".

Whichever one results in more rows of usable data would be the one we use.

botanical commented 1 year ago

Taking a look at some of the admin2 that have no matches such as:

CMR,BAFIA,
CMR,BIBEMI,
CMR,BIYEMASSI,
...
CMR,BONASSAMA,
CMR,BOURHA,
CMR,BUEA,

I've found that some are valid names of communes of Cameroon, for example BAFIA is Bafia, BIBEMI is Bibemi, BONASSAMA is Bonassama, etc. so it's puzzling as to why it's not matching.BIYEMASSI on the other hand, is Biyem-Assi which is more understandably hard to match with a special character.

chuckwondo commented 1 year ago

@botanical, that's what I described above, with this:

I looked up the first row with a missing NAME_2 value (Bafia in Cameroon) and discovered that Bafia is a town/commune, which makes it an admin3 name, not an admin2 name. More specifically the hierarchy is

  • Cameroon (adm0)
  • Centre (province, adm1)
  • Mbam-et-Inoubou (department, adm2)
  • Bafia (town/commune, adm3)

Therefore, for this particular line in the CSV, the value of NAME_2 should be Mbam-et-Inoubou.

TLDR; I believe we should simply increase our minimum fuzzy matching score from 70 to perhaps 75 or so (TBD by a bit of experimentation) to eliminate a number of bad matches, and the simply drop all rows where we are unable to find a match, and this should hopefully still provide us with roughly 500 rows of admin2-level outbreaks to work with.

To clarify, we are using fuzzy matching to find a match for "BAFIA" within a list of admin2 names, but since BAFIA (Bafia) is an admin3 name, there's no way for us to find a match for it within a list of "official" admin2 names (unless there just happens to be an admin2 name very similar to this admin3 name).

Therefore, in order to match BAFIA (as well as all of the other names in the admin2 column that are not admin2 names, but rather, admin3 names), we must fuzzy match them against a list of "official" admin3 names. This means that we have to also download admin3 GeoJSON files (in addition to the admin2 GeoJSON files we're already downloading).

Put another way, here is the value of the location column for Bafia in the original csv file:

afr::cmr::centre::bafiahealthdistrict

This erroneously has bafiahealthdistrict in the "admin2" position (note that I stripped the "healthdistrict" suffix from all such names in order to improve fuzzy matching), when it should be in the "admin3" position. Given the quoted bullet list above, the correct value of location for this row should be the following (stripping non-letters and lowercasing, as seems to be the pattern for this dataset):

afr::cmr::centre::mbametinoubou::bafiahealthdistrict

Our current attempt to match admin2 names from the data set to "official" admin2 names pulled from the admin2 GeoJSONs works poorly because there are so many rows in the dataset that are erroneous in the same manner as the particular row above because our current process is as follows:

  1. Select only rows where spatial_scale is "admin2" (under the assumption that the values in the location column are valid, other than casing and non-letters being stripped out -- which we now know to be a poor assumption)
  2. Split location column into separate ISO3 (country), admin1, and admin2 columns
  3. For all unique countries in the selected rows, download the admin2 GeoJSON files
  4. For each unique country-admin2 pair from our dataset, fuzzy match the admin2 name against the "official" admin2 names (NAME_2 properties) within the GeoJSON file for the corresponding country

Unfortunately, to reiterate, since many of the admin2 names from the dataset are actually admin3 names, we will of course be unable to match the admin3 name (masquerading as an admin2 name) against a list of admin2 names. This is why, for example, we don't get a match for Bafia (which is an admin3 name, not an admin2 name).

Further, even beyond this, it looks like we need to increase our current minimum score of 70 during fuzzy matching of admin2 names (against "official" admin2 names), because I see a number of bad matches, such as these, just to pull a few from the top of the "match" list:

ISO3,admin2,NAME_2
CMR,DJOUNGOLO,Moungo
CMR,KUMBA,BoumbaetNgoko
CMR,MAYOOULO,MayoLouti
CMR,MINDIF,Ndé
CMR,NGAOUNDEREURBAN,Ndé
CMR,NKOLNDONGO,Ndé
CMR,NYLON,NyongetKéllé
COD,ANGUMU,Kasangulu
COD,BANGABOLA,Banalia
COD,BARUMBU,Aru

At the moment, we have 358 unique country-admin2 pairs matched, including bad matches. When merged with the dataset, this yields 668 rows (out of a total of 692 rows where spatial_scale is "admin2"). I don't know how many bad matches will be eliminated by increasing the minimum fuzzy matching score, but hypothetically speaking, if as much as 25% of the matches are bad, that still leaves us with perhaps 269 unique country-admin2 pairs matched, which would still leave us with perhaps 500 rows of the original data to work with (assuming the same ratio of unique pairs to merged results).

I would recommend we steer clear of attempting to match the admin2 names that are actually admin3 names because that would require downloading admin3 GeoJSON files, and a number of countries simply don't have admin3 GeoJSONs available. For example, Zambia (ZMB) has no admin3 GeoJSON available at gadm.org (actually, geodata.ucdavis.edu), nor at geoboundaries.org.

To further add to this ordeal, https://geodata.ucdavis.edu/ is down, and has been down for at least a few days now. This is the site that gadm.org takes us to when looking for GeoJSONs there. Therefore, we may want to switch back to using geoboundaries.org for fetching our GeoJSON files.

kathrynberger commented 1 year ago

Fantastic work @chuckwondo and @botanical ! 🚀 You've really done some great detective work on this.

Just a few follow-up questions:

Agree also with your note on using the geoboundaries.org if the https://geodata.ucdavis.edu/ is down. Is there a benefit for using one over the other?

Lastly, I think this exploration deserves a shout out in the how repo . I think this would be a nice part of sharing our learnings. What do you all think?

chuckwondo commented 1 year ago

Just a few follow-up questions:

  • if I understand correctly, the 358 country-admin2 pairs are essentially the total number of unique county/districts?

Since geodata.ucdavis.edu is still offline, I can't confirm, but IIRC, the number of unique country-admin2 pairs is 484, but only 358 were matched to corrected admin2 names, and that figure includes bad matches.

  • so when matched to the outbreak cases csv file this yields 668 rows (692 - 668 = 24), suggesting only 24 outbreak events are lost due to the mismatching of admin2 names. (This is not bad at all, and in fact impressive!)

Technically, 24 were lost to un-matched names, not mis-matched names.

  • with the estimate above, it is known that there are a number of bad matches - the extent of which is unknown just yet - but hypothetically speaking could be as much as 25% leaving closer to 500 rows of outbreaks for our model development. (again, we'll keep our fingers crossed 25% is not the case - but it is what it is - that is still enough to work on our model development)

Agree also with your note on using the geoboundaries.org if the https://geodata.ucdavis.edu/ is down. Is there a benefit for using one over the other?

The benefit of using the ones from gadm is that each feature includes the location "breadcrumb," so you can see the admin hierarchy/path, whereas the ones from geoboundaries don't, so there's no easy way to see the admin hierarchy/path between admin0 (country) and whichever admin level the GeoJSON is for, such as admin2 or admin3.

However, we've concluded that we don't need these "breadcrumbs" for our purposes, so pulling GeoJSONs from geoboundaries shouldn't negatively impact us here.

Lastly, I think this exploration deserves a shout out in the how repo . I think this would be a nice part of sharing our learnings. What do you all think?

Yes! Once we nail down the final bits of this fuzzy matching business, I think it would be worth detailing the scenic route we took to geocode our regions.

chuckwondo commented 1 year ago

@botanical and I worked through some details today, and found that using a score cutoff of 70 during fuzzy matching produced far too many bad matches between the admin2 names in our dataset and "official" admin2 names in the GeoJSON files. As mentioned earlier, this is in large part due to many of the admin2 names being admin3 names.

Therefore, we experimented with increasing the score cutoff until we reached the point where we found no bad matches, which turns out to be 91. Using a score cutoff of 91, the following are the only matches with scores below 100, all of which are valid:

ISO3,admin2,NAME_2,score
MLI,DJENNE,Djenné,91
GIN,DUBREKA,Dubréka,92
ETH,OROMIYA,Oromia,92
COD,KATANA,Katanda,92
SSD,FASHODA,Fashooda,93
NGA,NASSARAWA,Nasarawa,94
GIN,FORECARIAH,Forécariah,95
NGA,ITASGADAU,Itas/Gadau,95
NGA,TAFAWABALEWA,Tafawa-Balewa,96
BEN,ABOMEYCALAVI,Abomey-Calavi,96
COD,MALEMBANKULU,Malemba-Nkulu,96
NGA,BIRNINMAGAJI/KIYAW,Birnin-Magaji/Kiyaw,97
NGA,IBADANSOUTHEAST,IbadanSouth-East,97

Unfortunately, this matches only 229 out of the 484 unique ISO3-admin2 pairs, leaving 255 unmatched pairs. Upon merging the matches with the original 692 rows of data where spatial_scale is "admin2", we're left with only 301 rows/outbreaks to work with.

Given that this is likely an insufficient number of outbreaks to work with, we've decided to attempt to match enough of the 255 unmatched names to result in our being able to use about 500 outbreaks. If the ratio of unique ISO3-admin pairs to outbreaks remains roughly the same, this means we'll need to match perhaps a little over 150 of the 255 unmatched names. Specifically, since 229 matches yielded 301 outbreaks, 381 matches should yield 500 outbreaks, and 381 - 229 = 152.

One option for doing this would be to obtain admin3 GeoJSONs and attempt to match the unmatched admin2 names against "official" admin3 names, since we know that some, if not all, of the unmatched admin2 names are actually admin3 names. However, not all countries have admin3 GeoJSONs available (neither at gadm nor geoboundaries), so this might not be possible, but might provide some help, if we need to resort to this approach.

Since attempting to fully manually match 150+ names would be very tedious and highly time-consuming, we'll attempt to use geopy's geocoding capability to help reduce the manual effort, roughly as follows:

  1. Create a geocoder: geocoder = geopy.geocoders.Nominatim(user_agent="cholera_lab")
  2. For each unmatched admin2:
    1. Use pycountry.countries.get(alpha_3="<ISO3>").alpha_2 to obtain the 2-letter country code via the 3-letter country code
    2. Find an address using geocoder.geocode(["<admin2>, <admin1>"], country_codes="<ISO2>").address, which should be a string of the form "..., <admin2>, <admin1>, <country>"
    3. Parse <admin2> out of the address string

With any luck, that will match a significant portion of the 255 unmatched admin2 names.

kathrynberger commented 1 year ago

@chuckwondo and @botanical great work and great perseverance on this. It really is turning out to be far more complex than anticipated!

From the paper where this data was sourced it provides the following breakdown of outbreaks by admin level (so that we can use it for our reference):

"From January 2010 through January 2020, we captured 999 cholera outbreaks with 484,450 suspected cholera cases from 744 unique sub-national regions across 25 sub-Saharan African countries in our database. This included 62 outbreaks in 50 unique first-level administrative units, 692 outbreaks in 492 unique second-level administrative units, and 245 outbreaks in 202 unique third-level administrative units (Figures 1 and S1, Table S1). Among them, only 128 sub-national regions reported at least one confirmed cholera case, and 876 sub-national regions reported cholera-associated deaths."

I don't imagine this to be a dealbreaker - but how might the boundary files you are using from geodata.ucdavis.edu differ from those available within the ICPAC Geoportal? The authors of the study used the ICPAC admin0 boundaries for their analysis, and admin1 and admin2 are also available. Lastly, admin3 is available for some countries on a national level basis.

Lastly, I've explored a bit of their R code for developing Figure 1 which can be found here. Which from a quick examination looks very similar to what you all have proposed above. I'd need to dig into it a bit further - since my R is a bit rusty, but want to understand if we are getting roughly the same results. And if not, why? Let me know your thoughts.

botanical commented 1 year ago

I don't imagine this to be a dealbreaker - but how might the boundary files you are using from geodata.ucdavis.edu differ from those available within the ICPAC Geoportal? The authors of the study used the ICPAC admin0 boundaries for their analysis, and admin1 and admin2 are also available. Lastly, admin3 is available for some countries on a national level basis.

Thanks for sharing, @kathrynberger! I wasn't aware of ICPAC Geoportal. From my cursory search, some of the results of ICPAC Geoportal seem to be extracted from the GADM database (geodata.ucdavis.edu) such as Ethiopia admin level 2.

Lastly, I've explored a bit of their R code for developing Figure 1 which can be found here. Which from a quick examination looks very similar to what you all have proposed above. I'd need to dig into it a bit further - since my R is a bit rusty, but want to understand if we are getting roughly the same results. And if not, why? Let me know your thoughts.

It's been a really long time since I've read R code 😅 It may take a couple days for me to decipher it.

@chuckwondo, I started working on finding the unmatched names using a geocoder as you suggested. However, I haven’t gotten to obtaining the geojsons yet. While developing the notebook to find the unmatched names, I realized I needed to refine the search by appending the country name to the query in order to avoid getting inaccurate matches for certain names such as Mora which yields multiple results.

I've created a CSV available here of the geocode matches and parsed admin2 values. This is all still a work in progress but wanted to keep everyone updated.

chuckwondo commented 1 year ago

@botanical, nice progress!

To help aid your matching, I believe your code can be refined in the following way, as I mentioned in my previous comment. Perhaps you missed these details (of course, the specific value of user_agent doesn't matter, so no need to change what you have for the first item):

  1. Create a geocoder: geocoder = geopy.geocoders.Nominatim(user_agent="cholera_lab")
  2. For each unmatched admin2:

    1. Use pycountry.countries.get(alpha_3="<ISO3>").alpha_2 to obtain the 2-letter country code via the 3-letter country code
    2. Find an address using geocoder.geocode(["<admin2>, <admin1>"], country_codes="<ISO2>").address, which should be a string of the form "..., <admin2>, <admin1>, <country>"
    3. Parse <admin2> out of the address string

The important differences between the above and what you've written:

For example, given our previous example of BAFIA, we have the following information from our original data, after splitting the location:

Therefore, our call to our geocoder should be this:

geocoder.geocode("BAFIA, centre", country_codes="CM")

@kathrynberger, Jennifer is spot in regarding the GeoJSONs.

Regarding the R code, I only took a very quick scan of the code (literally about 20 seconds), but it seems to me that they are pulling geometries from a shapefile, and thus are not dealing with pulling GeoJSONs, which would answer my question about how they geocoded when we can plainly see that their location information is woefully inadequate to attempt geocoding by matching their location names (admin names) to names in "official" GeoJSONs.

Are we needlessly spinning our wheels attempting to use GeoJSONs instead of just using their shapefile? Is the shapefile publicly available? (I haven't looked.)

kathrynberger commented 1 year ago

thanks @botanical and @chuckwondo for the updates on progress 🚀

quick answer to your last question @chuckwondo is that yes the shapefile that they used is publicly available - it is the ICPAC files that I provided in the links above - which as @botanical pointed out seem to be extracted from the GADM database.

Looking at the outbreaks file - I'm not sure how that changes things - as there's not lat/long coordinates in the outbreak file only country name and location (so would run into the same issue with the poorly constructed location column).

chuckwondo commented 1 year ago

@botanical, however, one thing I notice is that sometimes, adding the admin1 name to the address actually hurts. For example:

>>> g.geocode("mora, farnorth", country_codes="cm")
>>> g.geocode("mora, far north", country_codes="cm")
Location(Mora, Mayo-Sava, Extrême-Nord, Cameroun, (11.0484452, 14.1342835, 0.0))
>>> g.geocode("mora", country_codes="cm")
Location(Mora, Mayo-Sava, Extrême-Nord, Cameroun, (11.0484452, 14.1342835, 0.0))

Notice that the first call produces no result when farnorth is used, but when it is changed to far north, it works. Of course, we can't readily automate the insertion of a space (or spaces) when appropriate for the admin1 values, because all admin1 values in the dataset are smashed together (oddly -- I don't know why anybody would do that on purpose), so we don't know where they belong without manually looking (which we want to avoid).

However, the 3rd call, where I've included only the admin2 name, we get a result.

Perhaps the logic would be to attempt a call that includes the admin1 name (under the assumption that this should more often help, rather than hurt), and then, if that does not return a value, try again with the admin1 name excluded (i.e., only the admin2 name).

chuckwondo commented 1 year ago

@kathrynberger, I'm still puzzled then, regarding geocoding. How did they do their geocoding?

chuckwondo commented 1 year ago

@botanical, Kathryn and I had a quick huddle to sort out the geocoding puzzle by looking at the original repository.

In particular, we looked at the total shapefile and found geometries, but things are still a little questionable, so Kathryn would like us to explore parallel paths:

Hopefully, one of these routes will get us where we want to be.

chuckwondo commented 1 year ago

Here's what I just pushed up to my chuck/exploration branch:

If you want to play around with what I did:

This might obviate the need to continue down the geopy.geocode path, since all 692 admin2 outbreaks are now georeferenced (well, 669 after the groupby operation). If we don't care about mapping admin2 names to "official/standard" names, then we likely no longer need to continue with geopy.geocode, unless we want to somehow compare what we can pull from GeoJSON files to the geometries in the shapefile, but that feels like a tangent we can avoid.

kathrynberger commented 1 year ago

Thanks @chuckwondo for looking into this 🐰 🕳️ of work, I think it really has proven beneficial. 🚀 I've just reviewed your branch and agree - we can likely pause our work using geopy.geocode and assume the geospatial units provided are correct.

FWIW - I don't like the lack of transparency/source with the authors' repo of where those original .shp files were derived by the source - but that is not the focus of this work either. I think it has been a helpful exercise in the investigations both you @chuckwondo and @botanical have worked on to geocode the location names, as this would be the standard approach for doing so (and how we'd want to build the dataset if we were doing so from the start. 💫

For all intents and purposes, I think we can shelve the geocoding process for now and continue on to next steps (including retrieving, pre-processing and building the environmental data frames (https://github.com/developmentseed/geospatial-ds-cholera-lab/issues/5, https://github.com/developmentseed/geospatial-ds-cholera-lab/issues/3). The geocoding work might be something that we go back later and investigate for comparison, but don't think we need to explore further now. 👍

botanical commented 1 year ago

For all intents and purposes, I think we can shelve the geocoding process for now and continue on to next steps (including retrieving, pre-processing and building the environmental data frames (https://github.com/developmentseed/geospatial-ds-cholera-lab/issues/5, https://github.com/developmentseed/geospatial-ds-cholera-lab/issues/3). The geocoding work might be something that we go back later and investigate for comparison, but don't think we need to explore further now. 👍

I don't want to speak for @chuckwondo but in our last pairing session, we did want to get your input on how important it was to map the official admin2 names and it sounds like the path Chuck has taken is sufficient and that the official/standard names aren't too important which aligns with what we were thinking 😄

Like you said, @kathrynberger, I think exploring how to retrieve the geometries ourselves was a worthwhile endeavor but I am excited to continue on to the next steps 🏃 I'll take a look at the issues you've mentioned 👀