Throw explicit errors if base maps are not covering everywhere we want to plot

caldwellst commented 6 months ago

So, in src-static/update_iso3_sf.R, we develop base maps stored in input/adm0/{iso3}.geojson. These are intended to be the base map for any maps produced in the system, such as for mapping of points done in map_points.R or map of the IPC classifications in admin areas.

However, for some of the basemaps, country territories have been removed to ensure the visuals are of a better size. For instance, the Galápagos Islands are not included for Ecuador, and we only have the contiguous United States. This is mostly for island and some coastal countries.

In most cases, this should not present issues. However, we should:

Develop a function that checks that for any spatial data, its bounding box fits within the bounding box of the adm0 base map. We might want to use buffers here, because slight misalignment of polygon boundaries should not flag an error, we can't be expected to perfectly align all boundaries.
Implement this function in map_points.R and any other indicator mapping functions that don't use map_points.R

Then, if we ever generate errors, we will have to develop custom basemaps for these situations that can then be reused in the future. For instance, if there is a tropical storm in Hawaii, we could develop a USA-H basemap that is used based on the spatial coordinates of the data. In this way, we don't have to deal with all eventualities, just deal with them as they arise, but confidently have the checks in place to know that we are not excluding points from our map.

zackarno commented 6 months ago

Hey @caldwellst - I created a new branch where I migrated the function I had previously written to check overlap of 2 geom bbox's. Here it is currently: assert_bbox_overlap

I'm trying to figure out where you want the function, 1. saved , 2. implemented.

On slack you said:

Set it as a validation/assetion with gg_adm0 I think.

But I don't think gg_adm0() is a thing 😄 . The most logical place i see for it's implementation would probably be map_points() ? However, the way the adm0 loading is wrapped in a geom_*() function makes it a bit clunky/inefficient? I think you will know what I mean, but let me know if not clear

caldwellst commented 6 months ago

Implement in geom_adm0() I meant. So if you try to call it to use the base plot, you just pass in the other spatial files that will be plotted as ..., and those all get checked against the sf_adm0. Feel free to pose alternative, but I think this makes most sense because we don't need multiline load and then plot the basemap in each plotting function, so validating at point of use in geom_adm0 makes most sense.

Note that your function doesn't check that one is contained within the other, but checks that they have no overlap. I think we want to check that the data we want to plot fits within the basemap bounding box, and if it isn't, throw an error?

zackarno commented 6 months ago

geom_adm0() implementation - gotcha -- yeah implementing in geom_adm0() seems like it would make sense.
contains vs overlap: ah yes, checking if adm0 bbox contains the other sf object makes more sense than whether or not they overlap ... but now I'm trying to remember why we were thinking of using bbox over a simple buffer on the admin 0
- There are multiple functional use-cases for this type of function and i'm trying to figure out what/what combination we are after.
  - In the Ecuador edge-case example, we'd be identifying perhaps some displacement in the galapagos and then have to customize the map to include the galapagos
  - Another issue could be that if points are outside of admin boundaries they could end up stretching the basemap aspect ratio/zoom
  - Another reason is to just find points that look funny because they are outside of the boundaries?

I suppose that in either of 3 use-cases i listed above it would be better to apply some buffer directly on the admin 0 because as you can see in the example below there is some risk in using the bbox. The blue polygon bbox still contains all the red points

library(sf)
#> Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(leaflet)
library(ggplot2)
library(gghdx)
library(dplyr)

gghdx()

pts_1 <- tribble(
  ~X,        ~Y,
  12.391938, 12.238959,
  12.50738, 12.034744,
  12.617324, 12.335638,
  12.441413, 12.464487,
  12.820722,  12.16374
  ) |>
  st_as_sf(coords = c("X","Y"),crs=4326)

poly_1 <- st_polygon(
  list(
    cbind(
      c(12.185373, 12.255774, 12.728886, 12.947403, 12.672345, 12.635662, 12.185373),
      c(12.529575, 11.939943, 11.967415, 12.21874, 12.435015, 12.558468, 12.529575)
    )
  )
) |> 
  st_sfc(crs=4326)

poly_2 <- st_polygon(
  list(
    cbind(
      c(12.240334, 12.26224, 12.913666, 12.240334),
      c(12.510266, 12.021816, 12.527339, 12.510266)
    )
  )
) |> 
  st_sfc(crs=4326)

ggplot()+
  geom_sf(
    data= poly_1,
    fill= "green",
    alpha=0.5
  )+
  geom_sf(
    data= poly_2,
    fill="blue",
    alpha=0.5
  )+
  geom_sf(
    data= pts_1 ,color="red"
  )+
  labs(
    title = "Poly 1 = Green , Poly 2 = Blue"
  )

^{Created on 2024-05-21 with reprex v2.0.2}

I guess the hard part is deciding what the buffer size should be? random brainstormy idea would be to buffer by x% (1-2 %? ) of the polygons longest axis length?

caldwellst commented 6 months ago

In the Ecuador edge-case example, we'd be identifying perhaps some displacement in the galapagos and then have to customize the map to include the galapagos

So this was the primary one I was looking for, because that is about something appearing outside the bounding box of the base map which will stretch the aspect ratio and likely indicate we have dropped some of the basemap we would want included in this case.

Another issue could be that if points are outside of admin boundaries they could end up stretching the basemap aspect ratio/zoom

So, this wouldn't be the case if they still lie within the bounding box I think. I didn't necessarily want to check for inclusion within the admin boundaries as maybe there are cases when things overlap or extend beyond for good reason? But we could do the check on that, and wouldn't be a problem. And if we in the future have maps that would be legitimate with points or shapes lying outside admin bounds, we can adjust that.

Another reason is to just find points that look funny because they are outside of the boundaries?

Yeah, so this could be a reason to go with admin boundary checks.

I guess the hard part is deciding what the buffer size should be? random brainstormy idea would be to buffer by x% (1-2 %? ) of the polygons longest axis length?

Honestly not sure here, interested in what you find out! Your suggestion sounds pretty reasonable. For me, the thing I know at the moment is that the admin boundaries for IPC will not match up with the base map, with some random overlaps, and likely some boundary points might leak over with some acceptable error. But we can implement something and if flags are generated that we disagree with, just adjust our buffer!

zackarno commented 6 months ago

@caldwellst - What do you think about this approach for buffering:

https://github.com/OCHA-DAP/hdx-signals/blob/24ca3ab8c89635d2cf233ad2d829647abaa4bde4/src/utils/assertions.R#L35-L74

The function above it which his just pseudo-code so far will test if any geom falls within the newly constructed buffered geom.

caldwellst commented 6 months ago

Looks good! Few questions:

is this 0.1% or 10%? Looks like 10%.
What's the speed like? Is it really slow for polygons with loads of edges?
Do we have to worry about st_cast() on multipolygons (or some other classes), where it might fail or throw errors?

zackarno commented 6 months ago

Started writing this next comment after seeing your questions, but I'll quickly answer those first (although maybe not relevant based on below)

yes - 0.1 would be 10 % in this construction
Speed is a bit slow! Bangladesh takes 1-3 seconds. Especially when lot of polygons
I thought the st_cast() was important as it splits up the multipolygons so that each one gets it's unique buffer. Imagine you have a massive country in 1 polygon with a small poly island. The buffer set by 1% of the entire area would be massive in size in comparison to the size of the island. Was also thinking in the case of somewhere like Philippines with thousands of small islands... if we don't cast it will be based on the bbox of the entire multipolygon... so I'm not sure its 100% necessary or not for our usecases.

But anyways I think i might have overcomplicated the issue. Let's not think about the st_cast() issue for the time being. It would be nice to just supply a numeric buffer like 10km and see if the points fall within that. This is much simpler, but could be fine for our use-case. The issue (i thought) with doing that is that we are not projected to meters, so the st_buffer() dist argument should be in degrees. Based on my interpretation of various SO posts and the function documentation this should be the case.

However, I was messing around with it and even when in WGS84 the distance argument appears to be working in meters.

In the example below I just took the "BGD" admin file from the blob and converted it directly to a pt (centroid) and then ran dput() to make a reprex ... did this to ensure the WKT/projection would be exactly the same, but simplify the problem with a point rather than poly.

# use exact feature from global monitoring repo 
# so that we know there are no minor changes to WKT/projection
box::use(readr)
box::use(dplyr)
box::use(purrr)
box::use(sf)
box::use(lf = leaflet)
box::use(lfe = leaflet.extras)

# original file
# bgd_adm0 <- readr$read_rds("bgd_adm0.rds")

pt <- structure(
  list(
    geometry = structure(list(structure(c(90.2877569589339, 
                                                     23.8275348365367), class = c("XY", "POINT", "sfg"))), n_empty = 0L, class = c("sfc_POINT", 
                                                                                                                                   "sfc"), precision = 0, bbox = structure(c(xmin = 90.2877569589339, 
                                                                                                                                                                             ymin = 23.8275348365367, xmax = 90.2877569589339, ymax = 23.8275348365367
                                                                                                                                   ), class = "bbox"), crs = structure(list(input = "WGS 84", wkt = "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    ID[\"EPSG\",4326]]"), class = "crs"))), row.names = 1L, sf_column = "geometry", agr = structure(integer(0), levels = c("constant", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      "aggregate", "identity"), class = "factor", names = character(0)), class = c("sf", 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   "data.frame"))

pt_buffered <- sf$st_buffer(
  pt, 
  dist= 5
)

# I manually drew a 5m line from pt using `{mapedit}`
# and recreated it here

line_5m <- 
  sf$st_linestring(
    cbind(
      c(90.28776,90.28771),
      c(23.82754 ,23.82754)
    ),
  ) |> 
  sf$st_sfc(crs=4326)

# map below shows the buffer is 5 meters
lf$leaflet() |>
  lf$addPolygons(data = pt_buffered) |>
  lf$addMarkers(data = pt) |>
  lf$addPolylines(data= line_5m)

The documentation seems to indicate that dist should be in degrees or units in st_crs(x)$units. I have never seen anything other than NULL returned from st_crs(x)$units... which is weird. I stumbled on st_crs(x)$units_gdal which returns "degree".... hmm...


sf$st_crs(pt)$units
#> NULL

sf$st_crs(pt)$units_gdal
#> [1] "degree"

If we look at the raw wkt string it seems like the it could be pulling the unit from LENGTHUNIT["metre",1]]] ... but this is not clear to me from any documentation or even when I opened up the internals of the st_buffer() function.

sf$st_crs(pt)
#> Coordinate Reference System:
#>   User input: WGS 84 
#>   wkt:
#> GEOGCRS["WGS 84",
#>     DATUM["World Geodetic System 1984",
#>         ELLIPSOID["WGS 84",6378137,298.257223563,
#>             LENGTHUNIT["metre",1]]],
#>     PRIMEM["Greenwich",0,
#>         ANGLEUNIT["degree",0.0174532925199433]],
#>     CS[ellipsoidal,2],
#>         AXIS["geodetic latitude (Lat)",north,
#>             ORDER[1],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>         AXIS["geodetic longitude (Lon)",east,
#>             ORDER[2],
#>             ANGLEUNIT["degree",0.0174532925199433]],
#>     ID["EPSG",4326]]

^{Created on 2024-05-22 with reprex v2.0.2}

Anyways, if we are certain it is in meters and that therefore sf$st_within_distance() is also in meters ... could just basically simplify this function to wrap a simple assertion around sf$st_within_distance() which would be much faster and simpler

caldwellst commented 6 months ago

Yes, I had found the same thing previously! However, to be safe, should we do quick projection to buffer and then back to projected? This is the process I did previously in update_iso3_sf.R, and maybe why there is meter data in there!? I used Azimuthal equidistant projections and the speed different was quite minimal I think, and then we just do your idea for a 10km buffer.

zackarno commented 6 months ago

Yeah, i've noticed it before too, but was never totally sure until i just checked. Therefore, i've generally just projected to the local UTM zone to be sure.

This is the process I did previously in update_iso3_sf.R, and maybe why there is meter data in there!?

I don't think so because it's there if I just draw any point in EPSG:4326 with mapedit/leaflet. A bit cofusing.

do quick projection to buffer and then back to projected?

but to which projection? I did see you used the"Azimuthal equidistant" , but wasn't entirely sure why? does it ensure meters? happy to look into it also!

caldwellst commented 6 months ago

I think from any CRS you can get meters, it seems, but I used Azimuthal equidistant because it preserves distance from the centre of the projection to other points. Thus for buffering a country, if you set the projection centre to be the centroid of the country, then you can reliably set a buffer of 50 meters or some amount, and not have to hassle with finding the UTM zone for that country (or worry about countries that span numerous UTM zones).

However, in looking into this, I think I realised the issues you saw in #93, where countries like New Zealand having really simplified boundaries is because I forgot to set the centre of the projection to the centroid, so it defaults to 0,0! Countries on the opposite side of the sphere start to see significant distortions, which likely means my polygon simplification with a tolerance of 50 meters was probably simplifying way too much because of this!

So, if you could implement azimuthal reprojection with centre based on centroid of the file, then we could fix the polygon issues in update_iso3_sf.R and use it for this project. What do you think?

zackarno commented 6 months ago

yeah i see, interesting! Now I've found out for buffering type calcs you can always reliably get meters if you supply a units object rather than just a number.

So to be safe, the implementation on buffer would be:

sf::st_buffer(
  pt1,
  units::set_units(1, meter)
  )

I will see if this also has any affect on update_iso_sf.R and will look into the centering the azimuthal projection on centroid. What is weird is that i would think a dtolerance simplification of truly 50 meters would be a tiny, essentially unnoticeable simplification for any country-level poly.... but i'll get back to you after i play around

zackarno commented 6 months ago

Implement in geom_adm0() I meant. So if you try to call it to use the base plot, you just pass in the other spatial files that will be plotted as ..., and those all get checked against the sf_adm0.

okay so i've simplified and added it there, I'll open a PR as it will be easier to discuss code a few more of the specific code details there i think

zackarno commented 6 months ago

However, in looking into this, I think I realised the issues you saw in https://github.com/OCHA-DAP/hdx-signals/issues/93, where countries like New Zealand having really simplified boundaries is because I forgot to set the centre of the projection to the centroid, so it defaults to 0,0!

FYI - Re NZ boundary, i don't think this is the issue as if we load the NZ boundary directly from the un_geodata.geojson it looks the same. Luckily, NZ seems like one of the lower priority countries for humanitarian agencies 😄

Nonetheless the centre of projection issue could still be causing issues. I've never heard of setting a centre of projection for a global projection/crs though. Do you have a reference for this or an idea would work w/ st_transform()? Can you point me to whatever lead you to choose azimuthal equidistant?

caldwellst commented 6 months ago

I would completely drop it. You could also use a variety of options, but the reality is we don't need to do anything. The reason sf::st_buffer() works is because PostGIS ST_Buffer is automatically projecting and then converting back to WGS84 on the backend, see here.

It determines a planar spatial reference system that best fits the bounding box of the geography object (trying UTM, Lambert Azimuthal Equal Area (LAEA) North/South pole, and finally Mercator ). The buffer is computed in the planar space, and then transformed back to WGS84. This may not produce the desired behavior if the input object is much larger than a UTM zone or crosses the dateline

All we just need is an alternative data source to UN GeoData which is not so simplified.

zackarno commented 6 months ago

I would completely drop it. You could also use a variety of options, but the reality is we don't need to do anything. The reason sf::st_buffer() works is because PostGIS ST_Buffer is automatically projecting and then converting back to WGS84 on the backend, see here.

It determines a planar spatial reference system that best fits the bounding box of the geography object (trying UTM, Lambert Azimuthal Equal Area (LAEA) North/South pole, and finally Mercator ). The buffer is computed in the planar space, and then transformed back to WGS84. This may not produce the desired behavior if the input object is much larger than a UTM zone or crosses the dateline

All we just need is an alternative data source to UN GeoData which is not so simplified.

yeah i think it makes sense to drop. Re- another data-source what about naturalearth ? I've had pretty good experiences with it. Then there is the GAUL we've all used before. They have a 500m simplified version also. I can play around w/ a new branch for testing GAUL vs Naturalearth ? Any others?

caldwellst commented 6 months ago

Those are the two that come to my mind! I wouldn't consider too much where the data is coming from if they look fine, and just take the one that has better coverage overall?

zackarno commented 6 months ago

yeah

should we just replace UN Geo Data Hub data totally for all non FieldMaps.
Were those specific data sources that are neither FieldMaps or UN Geo Data just added because they were not in either or was there another reason? If we find them in the UN replacement Naturalearth/GAUL should we default to the replacement rather than pulling in an additional?

zackarno commented 6 months ago

I would completely drop it. You could also use a variety of options, but the reality is we don't need to do anything. The reason sf::st_buffer() works is because PostGIS ST_Buffer is automatically projecting and then converting back to WGS84 on the backend, see here.I would completely drop it. You could also use a variety of options, but the reality is we don't need to do anything. The reason sf::st_buffer() works is because PostGIS ST_Buffer is automatically projecting and then converting back to WGS84 on the backend, see here.

Update - Re dropping azimuthal projection - we can't drop it from the code as you used st_simplify() step where it is still having an impact on the dtolerance parameter. In wgs84 even if you use units::set_units(x, meters) the dtolerance param converts it to degrees. It confirms this with a warning and I checked it visually for Afghanistan. But yeah, we will continue to keep it out of the geometry checking assertion (it was only ever used in st_simplify())

caldwellst commented 6 months ago

Were those specific data sources that are neither FieldMaps or UN Geo Data just added because they were not in either or was there another reason? If we find them in the UN replacement Naturalearth/GAUL should we default to the replacement rather than pulling in an additional?

Simply because they were unavailable in either, exactly! Use the default replacement unless it isn't available.

Should we just replace UN Geo Data Hub data totally for all non FieldMaps.

Yep, definitely! Ideally we minimise map sources as much as possible.

zackarno commented 5 months ago

probably can close?

OCHA-DAP / hdx-signals

Throw explicit errors if base maps are not covering everywhere we want to plot #74