DOI-USGS / ncdfgeom

NetCDF-CF Geometry and Timeseries Tools for R: https://code.usgs.gov/water/ncdfgeom
https://doi-usgs.github.io/ncdfgeom/
18 stars 8 forks source link

Question about using `normalize = TRUE` in `calculate_area_intersection_weights` #102

Open lkoenig-usgs opened 2 weeks ago

lkoenig-usgs commented 2 weeks ago

I'm a little unsure about the usage of normalize = TRUE in the calculate_area_intersection_weights function and hoping you can help clarify. My use case is calculating the intersection weights between NHDPlusv2 catchments and NHGFv1.1 HRUs (working with Ellie W on this).

Specifically, I'm wondering about using normalize = TRUE to return the fractional area of a target polygon (e.g., NHGF) that is covered by the source (e.g., NHDv2).

# as a reprex, wrangle a pseudo NHGF target polygon using some NHD catchments
comids <- c(4648620, 4648714, 4648568, 4648542, 4648710)
starting_polygons <- nhdplusTools::get_nhdplus(comid = comids, realization = "catchment", t_srs = 5070)
target_areas <- starting_polygons |>
  mutate(id = 1) |>
  group_by(id) |>
  summarize(do_union = TRUE) 

# looking at a case where the source polygons don't completely overlap the target polygons, 
# like, say, along the U.S.-Canada border
source_areas <- filter(starting_polygons, featureid %in% c(4648710, 4648542, 4648620))

image

I assumed that for the target polygon (blue), the weights would only sum to 1 if it was fully covered by the source polygons (red), so I was surprised that the summed weights equal one in this case:

weights <- ncdfgeom::calculate_area_intersection_weights(
     x = select(source_areas, featureid, geometry), 
     y = select(target_areas, id, geometry), 
     normalize = TRUE
 )
# I was expecting something like 0.68. Am I misinterpreting something?
sum(weights$w)
#> 1

The crux of my question is, if normalize = TRUE, is the intended behavior that the intersected areas are divided by the total intersecting area (as seems to be the case here), or by the target area? I'd expect those two to be the same if the target areas were completely overlapping with the source areas, but not in cases like the example here.

If the latter, maybe the target area could be added as a field to y, and then the totalArea_y = unique(target_area)?

Thanks!

dblodgett-usgs commented 2 weeks ago

I'm rereading my documentation realizing it could be much more clear. It's been a bit since I looked at this and even then it was super confusing... so bear with me.

From the examples:

# say we have data from `a` that we want sampled to `b`.
# this gives the percent of each `a` that intersects each `b`

(a_b <- calculate_area_intersection_weights(a, b, normalize = FALSE))

# note that `w` sums to 1 where `b` completely covers `a`. [that is where the source completely covers the target]

In your example, the polygons from your source areas sum to 1 per target polygon. To apply weights, you need the area of each of your source polygons for weighted sum per target polygon.

So in this case, area is the area of your source polygons and you would do this grouped by target polygon id.

sum( (val * w * area), na.rm = TRUE ) / sum(w * area)

The fact that there is no data over a portion of the target polygon is essentially ignored here and our area-weighted mean is only taking into account the source polygons that actually intersect the target.

Is this helping? I think it's that partial overlap that is confusing. I'm not going to claim to have great ways to explain this! Will leave this issue open to clean up the documentation. A PR would be MUCH appreciated!

lkoenig-usgs commented 2 weeks ago

Thanks, Dave! I definitely agree that the partial overlap was the point of confusion for me. I asked about the intended behavior because I saw some references to the target polygon area (like here and here) rather than the intersected area. I can think about ways this may have been clearer to me as a user.

On one hand, it makes sense to me that the summed weights would return 1 given the name of the argument (i.e., normalize = TRUE). On the other hand, I usually use these summed weights as a check (sum to 1? ok, good!), and this edge case was kind of silently passing through without consideration. I'm open to reconsidering that, of course, especially in light of your documentation, which as you point out, states this assumption about complete overlap. But this also seems inconsistent with the behavior in gdptools, which I think uses the target area to compute the weights (again, I wouldn't expect any difference for the "usual" case of complete overlap, but that this distinction would create differences in edge cases like this one).

dblodgett-usgs commented 1 week ago

Yeah... My sense of it is that I've never gotten (and consistently used) a good set of words for the arguments of the functions involved in these workflows.

"Source data polygons" and "target polygons" should probably be used throughout?

This normalized nuance is really just whether the weights have been normalized so you don't need to know the area of each source data polygon when calculating your area weights. The weights are a cached intermediate value and it's really just convention to save the weights in the normalized form or not. gdptools does it with normalized=TRUE and historically, I had been doing it with normalized=FALSE.

My next steps with this work is to rewrite geoknife's internals to either call the gdptools-based web service or use ncdfgeom to iterate over a dataset with weights as we build here -- in that work, I'll be sure to clear all this up. With geoknife, it will be important that the cached weights are coherent and that we know what is what in using the cached values. I may just kill off the "normalized = FALSE" mode to be honest, I'm not sure it's really adding any value at this point.

lkoenig-usgs commented 1 week ago

' # with normalize = TRUE, w will sum to 1 when the target

' # completely covers the source rather than when the source completely

' # covers the target.

Thanks for providing such extensive documentation, including examples 🙌 ! My confusion around the normalize = TRUE argument is that I can't imagine a case where the summed weights would not equal 1 given that the weight is computed using the summed intersected area.

For instance, in the first example from the docs, the output doesn't appear to me to follow the expectation described in the quote above. In other words, the weights sum to 1 even though the target (b) does not completely cover the source (a). Is this expected?

a_b <- calculate_area_intersection_weights(a, b, normalize = TRUE)
a_b
#> # A tibble: 4 × 3
#>     ida   idb     w
#>   <dbl> <dbl> <dbl>
#> 1     1     1 1    
#> 2     2     2 1    
#> 3     2     3 0.375
#> 4     3     3 0.625

dplyr::summarize(dplyr::group_by(a_b, idb), w = sum(w))
#> # A tibble: 3 × 2
#>     idb     w
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     3     1

example1

If so, would the following edits to the documentation help make this behavior clear?

' normalize = TRUE, weights are divided by the ~target polygon area~ total intersected area per target polygon from all source areas, such that weights

' sum to 1 per TARGET polygon ~if the target polygon is fully covered by source polygons~ even if the target polygon is not fully covered by source polygons.

' # with normalize = TRUE, w will sum to 1 ~when the target

' # completely covers the source rather than when the source completely

' # covers the target~ even if the target does not completely cover the source.

I haven't looked under the hood at gdptools but based on a quick comparison, I'd think a small edit would be needed to match the normalize = TRUE behavior for cases where there is partial overlap:

    # for normalized, we return the percent of each target covered by each source
    y <- mutate(y, target_area = as.numeric(sf::st_area(geometry)))
    int <- areal::aw_intersect(y,
                                source = x,
                                areaVar = "area_intersection")

    # for normalized, we sum the intersection area by the total target area
    int <- ungroup(mutate(group_by(int, vary), totalArea_y = unique(target_area)))

I hope I'm not missing something obvious. I agree with you that when using these intermediate weights to compute an area-weighted mean, for example, this behavior shouldn't really be consequential, since w is in the numerator and the denominator. I mention it here because Ellie mentioned she and Mike are publishing the intermediate weights for some polygon-polygon intersections and are sometimes seeing different values from R and Python (I figured this behavior I previously noticed might be related to that).

dblodgett-usgs commented 1 week ago

Putting some time on this one. I found a typo above where I mixed up source and target. My brain is not good at these details. How is this looking? I'm going to keep going but this is a good start?

  library(ncdfgeom)
g <- list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1)))

a1 = sf::st_polygon(g) * 0.8
a2 = a1 + c(1, 2)
a3 = a1 + c(-1, 2)

b1 = sf::st_polygon(g)
b2 = b1 + 2
b3 = b1 + c(-0.2, 2)
b4 = b1 + c(2.2, 0)

a = sf::st_sfc(a1,a2,a3)

b = sf::st_sfc(b1, b2, b3, b4)

plot(c(a,b), border = NA)
plot(a, border = 'darkgreen', add = TRUE)
plot(b, border = 'red', add = TRUE)

a <- sf::st_sf(a, data.frame(ida = c(1, 2, 3)))
b <- sf::st_sf(b, data.frame(idb = c(7, 8, 9, 10)))

text(sapply(sf::st_geometry(a), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(sf::st_geometry(a), \(x) mean(x[[1]][,2]) + 0.3),
     a$ida, col = "darkgreen")

text(sapply(sf::st_geometry(b), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(sf::st_geometry(b), \(x) mean(x[[1]][,2])),
     b$idb, col = "red")


sf::st_agr(a) <- sf::st_agr(b) <- "constant"
sf::st_crs(b) <- sf::st_crs(a) <- sf::st_crs(5070)

calculate_area_intersection_weights(a, b, normalize = FALSE)
#> Loading required namespace: areal
#> # A tibble: 4 × 3
#>     ida   idb     w
#>   <dbl> <dbl> <dbl>
#> 1     1     7 1    
#> 2     2     8 0.5  
#> 3     2     9 0.375
#> 4     3     9 0.625

# NOTE: normalize = FALSE so weights sum to 1 per source polygon
# when source is fully within target.

calculate_area_intersection_weights(a, b, normalize = TRUE)
#> # A tibble: 4 × 3
#>     ida   idb     w
#>   <dbl> <dbl> <dbl>
#> 1     1     7 1    
#> 2     2     8 1    
#> 3     2     9 0.375
#> 4     3     9 0.625

# NOTE: normalize = TRUE so weights sum to 1 per target polygon. Non-overlap
# is ignored as if it does not exist.

calculate_area_intersection_weights(b, a, normalize = FALSE)
#> # A tibble: 5 × 3
#>     idb   ida     w
#>   <dbl> <dbl> <dbl>
#> 1     7     1  0.64
#> 2     8     2  0.32
#> 3     9     2  0.24
#> 4     9     3  0.4 
#> 5    10    NA NA

# NOTE: normalize = FALSE so weights never sum to 1 since no source is fully
# within target.

calculate_area_intersection_weights(b, a, normalize = TRUE)
#> # A tibble: 5 × 3
#>     idb   ida      w
#>   <dbl> <dbl>  <dbl>
#> 1     7     1  1    
#> 2     8     2  0.571
#> 3     9     2  0.429
#> 4     9     3  1    
#> 5    10    NA NA

# NOTE: normalize = TRUE so weights sum to 1 per target polygon. Non-overlap
# is ignored as if it does not exist.

Created on 2024-11-15 with reprex v2.1.1

dblodgett-usgs commented 1 week ago

The second example:

This is now also rendered in https://github.com/DOI-USGS/ncdfgeom/pull/103

@lkoenig-usgs, @elliewhite-usgs, @rmcd-mscb, your review and thoughts on how to help clarify this and make it easier to understand would be greatly appreciated!

# a more typical arrangement of polygons

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(ncdfgeom)

g <- list(rbind(c(-1,-1), c(1,-1), c(1,1), 
                c(-1,1), c(-1,-1)))

a1 = st_polygon(g) * 0.75 + c(-.25, -.25)
a2 = a1 + 1.5
a3 = a1 + c(0, 1.5)
a4 = a1 + c(1.5, 0)

b1 = st_polygon(g)
b2 = b1 + 2
b3 = b1 + c(0, 2)
b4 = b1 + c(2, 0)

a = st_sfc(a1,a2, a3, a4)
b = st_sfc(b1, b2, b3, b4)

b <- st_sf(b, data.frame(idb = c(1, 2, 3, 4)))
a <- st_sf(a, data.frame(ida = c(6, 7, 8, 9)))

plot(st_geometry(b), border = 'red', lwd = 3)
plot(st_geometry(a), border = 'darkgreen', lwd = 3, add = TRUE)

text(sapply(st_geometry(a), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(st_geometry(a), \(x) mean(x[[1]][,2]) + 0.3),
     a$ida, col = "darkgreen")

text(sapply(st_geometry(b), \(x) mean(x[[1]][,1]) - 0.4),
     sapply(st_geometry(b), \(x) mean(x[[1]][,2]) - 0.5),
     b$idb, col = "red")


st_agr(a) <- st_agr(b) <- "constant"
st_crs(b) <- st_crs(a) <- st_crs(5070)

a$val <- c(1, 2, 3, 4)
a$a_areasqkm <- 1.5 ^ 2

plot(a["val"], reset = FALSE)
plot(st_geometry(b), border = 'red', lwd = 3, add = TRUE, reset = FALSE)
plot(st_geometry(a), border = 'darkgreen', lwd = 3, add = TRUE)

text(sapply(st_geometry(a), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(st_geometry(a), \(x) mean(x[[1]][,2]) + 0.3),
     a$ida, col = "darkgreen")

text(sapply(st_geometry(b), \(x) mean(x[[1]][,1]) - 0.4),
     sapply(st_geometry(b), \(x) mean(x[[1]][,2]) - 0.5),
     b$idb, col = "red")


# say we have data from `a` that we want sampled to `b`.
# this gives the percent of each `a` that intersects each `b`

(a_b <- calculate_area_intersection_weights(
  select(a, ida), select(b, idb), normalize = FALSE))
#> Loading required namespace: areal
#> # A tibble: 9 × 3
#>     ida   idb     w
#>   <dbl> <dbl> <dbl>
#> 1     6     1 1    
#> 2     7     1 0.111
#> 3     8     1 0.333
#> 4     9     1 0.333
#> 5     7     2 0.444
#> 6     7     3 0.222
#> 7     8     3 0.667
#> 8     7     4 0.222
#> 9     9     4 0.667

# NOTE: `w` sums to 1 per `a` in all cases

summarize(group_by(a_b, ida), w = sum(w))
#> # A tibble: 4 × 2
#>     ida     w
#>   <dbl> <dbl>
#> 1     6     1
#> 2     7     1
#> 3     8     1
#> 4     9     1

# Since normalize is false, we apply weights like:
st_drop_geometry(a) |>
  left_join(a_b, by = "ida") |>
  mutate(a_areasqkm = 1.5 ^ 2) |> # add area of each polygon in `a`
  group_by(idb) |> # group so we get one row per `b`
  # now we calculate the value for each b with fraction of the area of each 
  # polygon in `a` per polygon in `b` with an equation like this:
  summarize(
    new_val = sum( (val * w * a_areasqkm), na.rm = TRUE ) / sum(w * a_areasqkm))
#> # A tibble: 4 × 2
#>     idb new_val
#>   <dbl>   <dbl>
#> 1     1    2   
#> 2     2    2   
#> 3     3    2.75
#> 4     4    3.5

# NOTE: `w` is the fraction of the polygon in a. We need to multiply w by the 
# unique area of the polygon it is associated with to get the weighted mean weight.

# we can go in reverse if we had data from b that we want sampled to a

(b_a <- calculate_area_intersection_weights(
  select(b, idb), select(a, ida), normalize = FALSE))
#> # A tibble: 9 × 3
#>     idb   ida      w
#>   <dbl> <dbl>  <dbl>
#> 1     1     6 0.562 
#> 2     1     7 0.0625
#> 3     2     7 0.25  
#> 4     3     7 0.125 
#> 5     4     7 0.125 
#> 6     1     8 0.188 
#> 7     3     8 0.375 
#> 8     1     9 0.188 
#> 9     4     9 0.375

# NOTE: `w` sums to 1 per `b` (source) only where `b` is fully covered by `a` (target).

summarize(group_by(b_a, idb), w = sum(w))
#> # A tibble: 4 × 2
#>     idb     w
#>   <dbl> <dbl>
#> 1     1  1   
#> 2     2  0.25
#> 3     3  0.5 
#> 4     4  0.5

# Now let's look at what happens if we set normalize = TRUE. Here we 
# get `a` as source and `b` as target but normalize the weights so
# the area of a is built into `w`.

(a_b <- calculate_area_intersection_weights(
  select(a, ida), select(b, idb), normalize = TRUE))
#> # A tibble: 9 × 3
#>     ida   idb      w
#>   <dbl> <dbl>  <dbl>
#> 1     6     1 0.562 
#> 2     7     1 0.0625
#> 3     8     1 0.188 
#> 4     9     1 0.188 
#> 5     7     2 1     
#> 6     7     3 0.25  
#> 7     8     3 0.75  
#> 8     7     4 0.25  
#> 9     9     4 0.75

# NOTE: if we summarize by `b` (target) `w` sums to 1 where above, with 
#       normalize = FALSE, `w` summed to one per `a` (source).

summarize(group_by(a_b, idb), w = sum(w))
#> # A tibble: 4 × 2
#>     idb     w
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1
#> 3     3     1
#> 4     4     1

# Since normalize is false, we apply weights like:
st_drop_geometry(a) |>
  left_join(a_b, by = "ida") |>
  group_by(idb) |> # group so we get one row per `b`
  # now we weight by the percent of each polygon in `b` per polygon in `a`
  summarize(new_val = sum( (val * w), na.rm = TRUE ))
#> # A tibble: 4 × 2
#>     idb new_val
#>   <dbl>   <dbl>
#> 1     1    2   
#> 2     2    2   
#> 3     3    2.75
#> 4     4    3.5

# NOTE: `w` is the fraction of the polygon from `a` overlapping the polygon from `b`. 
# The area of `a` is built into the weight so we just sum the weith times value oer polygon.

Created on 2024-11-15 with reprex v2.1.1

dblodgett-usgs commented 1 week ago

Addressing your comment @lkoenig-usgs -- I think you are correct, with normalize = TRUE, the weights will always sum to 1 per target polygon by design. I think I fixed that type in what I have in the comments above. This will be the new example documentation once we've had a chance to review.

rmcd-mscb commented 5 days ago

Fun - geometry brain teaser! @dblodgett-usgs and @lkoenig-usgs Here are some comments starting from the top.

in the a_b and normalized = False case: The initial output of a_b makes sense, but when you summarize the group_by, I would think it should be summarize(group_by(a_b, idb), w = sum(w)) note I'm using idb instead of ida, and only the weights for idb = 1 would sum to 1, reflecting the fact that a (green) does not completely cover b (red). This is important and should force the user to think about how they want to work with partially covered polygons, because in many cases your source does not completely overlay your target. Your values are correct. So I think you should summarize your weights differently to reflect the fact that a incompletely covers b ...IMHO...

As a general comment on the documentation, I would call a - green and b - red, and a_b - green_red, etc.

For b_a, again I think you are summarizing the weights incorrectly, You should be grouping by ida, You want all the weights of b that intersect a, so you group_by ida, and in this case since b completely covers a, your weights should sum to 1 in all cases. Again this helps the user understand that b completely covers a.

The last one a_b with normalize = True, is correct.

Summarizing. I think it's all correct except for how you are summarizing the weights. I think in all cases you should summarize the weights using the same pattern, grouping by the id of target, Or in other words which values of the source are used to calculate the values of the target.

I think I'm correct here, and maybe it's just semantics, but I think looking at the weights consistently will help the user. If I'm offbase, this is a bit of a brain teaser, I apologize :)

dblodgett-usgs commented 5 days ago

This detail of whether to summarize by the ids of the source data polygons or the ids of the target data polygons is important. Teasing out what it means that we are summarizing to one or the other is the key to understanding what's going on. (I'm not going to claim to fully understand!)

I'll stare at this for a bit and see if I can come up with a clear explanation and update the PR.

Trying some plain language:

Ultimately, we want values on the basis of the target polygons so grouping by their ids is what we need to do to get the answer. The weights are what allow us to add up the relative contributions of the value from each source-data polygon per target polygon. This distinction is independent of normalize = TRUE or FALSE.

I like your suggestion to name the example inputs by the color used to visualize them. I'll pick colors less prone to color deficiencies though.

Here's an update WIP. I'll keep going. Honestly, thinking I'll just drop the normalize = TRUE/FALSE detail here. It's just confusing things and not adding anything since the answers are the same.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(sf)
#> Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
library(ncdfgeom)

g <- list(rbind(c(-1,-1), c(1,-1), c(1,1), c(-1,1), c(-1,-1)))

blue1 = sf::st_polygon(g) * 0.8
blue2 = blue1 + c(1, 2)
blue3 = blue1 * 1.2 + c(-1, 2)

pink1 = sf::st_polygon(g)
pink2 = pink1 + 2
pink3 = pink1 + c(-0.2, 2)
pink4 = pink1 + c(2.2, 0)

blue = sf::st_sfc(blue1,blue2,blue3)

pink = sf::st_sfc(pink1, pink2, pink3, pink4)

plot(c(blue,pink), border = NA)
plot(blue, border = "#648fff", add = TRUE)
plot(pink, border = "#dc267f", add = TRUE)

blue <- sf::st_sf(blue, data.frame(idblue = c(1, 2, 3)))
pink <- sf::st_sf(pink, data.frame(idpink = c(7, 8, 9, 10)))

text(sapply(sf::st_geometry(blue), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(sf::st_geometry(blue), \(x) mean(x[[1]][,2]) + 0.3),
     blue$idblue, col = "#648fff")

text(sapply(sf::st_geometry(pink), \(x) mean(x[[1]][,1]) + 0.4),
     sapply(sf::st_geometry(pink), \(x) mean(x[[1]][,2])),
     pink$idpink, col = "#dc267f")


sf::st_agr(blue) <- sf::st_agr(pink) <- "constant"
sf::st_crs(pink) <- sf::st_crs(blue) <- sf::st_crs(5070)

(blue_pink_norm_false <- 
calculate_area_intersection_weights(blue, pink, normalize = FALSE))
#> Loading required namespace: areal
#> # A tibble: 4 × 3
#>   idblue idpink     w
#>    <dbl>  <dbl> <dbl>
#> 1      1      7 1    
#> 2      2      8 0.5  
#> 3      2      9 0.375
#> 4      3      9 0.604

# NOTE: normalize = FALSE so weights sum to 1 per source polygon
# only when a source polygon is fully covered by the target. The
# non-intersecting portion is not included.

# the following breaks down how to use these weights for one source polygon.

blue$val = c(30, 10, 20)
blue$area <- as.numeric(sf::st_area(blue))

(result <- st_drop_geometry(blue) |>
  left_join(blue_pink_norm_false, by = "idblue"))
#>   idblue val   area idpink         w
#> 1      1  30 2.5600      7 1.0000000
#> 2      2  10 2.5600      8 0.5000000
#> 3      2  10 2.5600      9 0.3750000
#> 4      3  20 3.6864      9 0.6041667

# To calculate the value for idpink 9, we would do:

((10 * 0.375 * 2.56) + (20 * 0.604167 * 3.6864)) / ((0.375 * 2.56) + (0.604167 * 3.6864))
#> [1] 16.98795

# or in a simple form for a whole table:

(result <- result |>
  group_by(idpink) |> # group so we get one row per target
  # now we calculate the value for each `pink` with fraction of the area of each
  # polygon in `blue` per polygon in `pink` with an equation like this:
  summarize(
    new_val = sum( (val * w * area), na.rm = TRUE ) / sum(w * area)))
#> # A tibble: 3 × 2
#>   idpink new_val
#>    <dbl>   <dbl>
#> 1      7    30  
#> 2      8    10  
#> 3      9    17.0

blue <- select(blue, idblue)

(blue_pink_norm_true <-
calculate_area_intersection_weights(blue, pink, normalize = TRUE))
#> # A tibble: 4 × 3
#>   idblue idpink     w
#>    <dbl>  <dbl> <dbl>
#> 1      1      7 1    
#> 2      2      8 1    
#> 3      2      9 0.301
#> 4      3      9 0.699

# NOTE: normalize = TRUE so weights sum to 1 per target polygon. 
# Non-overlap is ignored as if it does not exist.

# the following breaks down how to use these weights for one source polygon.

blue$val = c(30, 10, 20)
blue$area <- as.numeric(sf::st_area(blue))

(result <- st_drop_geometry(blue) |>
    left_join(blue_pink_norm_true, by = "idblue"))
#>   idblue val   area idpink         w
#> 1      1  30 2.5600      7 1.0000000
#> 2      2  10 2.5600      8 1.0000000
#> 3      2  10 2.5600      9 0.3012048
#> 4      3  20 3.6864      9 0.6987952

((10 * 0.3012) + (20 * 0.6988)) / ((0.3012) + (0.6988))
#> [1] 16.988

(result <- result |>
    group_by(idpink) |> # group so we get one row per target
    # now we calculate the value for each `pink` with fraction of the area of each
    # polygon in `blue` per polygon in `pink` with an equation like this:
    summarize(
      new_val = sum( (val * w), na.rm = TRUE )))
#> # A tibble: 3 × 2
#>   idpink new_val
#>    <dbl>   <dbl>
#> 1      7    30  
#> 2      8    10  
#> 3      9    17.0

Created on 2024-11-18 with reprex v2.1.1

rmcd-mscb commented 4 days ago

I think @lkoenig-usgs has it cracked: https://github.com/DOI-USGS/ncdfgeom/issues/102#issuecomment-2472142104.

Here are the equations for Plain Areal Weights Interpolation: V_T = (Σ(V_Si * A_iT)) / (Σ(A_iT)) Where:

V_T: Interpolated value for the target polygon. V_Si: Attribute value for source polygon Si. A_iT: Area of intersection between source polygon Si and target polygon T. Σ(A_iT): Total intersected area between source polygons and the target polygon.

And Normalized Areal Weights: V_T = Σ(V_Si * (A_iT / A_T)) Where:

V_T: Interpolated value for the target polygon. V_Si: Attribute value for source polygon Si. A_iT: Area of intersection between source polygon Si and target polygon T. A_T: Total area of the target polygon.

In either case we are normalizing by the target area. If we can get this change made to ncdfgeom and recompare with gdptools, we may still see some edge cases that are different. :)

where the weights are either A_iT in the normalized=False case or Ai_T/A_T in the normalized=True case.

mjcashman commented 4 days ago

I would like to add at least one use-case where the weights normalized to target polygon area might not be ideal: when there is large size disparity between source and target polygons.

I'm mainly working off NHD Catchments and have predominantly been weighting NHD variables to target polygons of various sizes, using both areal and length (flowline-length based) weighting. Yes HUC12, but especially large regions, like Level 3 Ecoregions, or the Van Metre Hydrologic regions used by the National IWAAs project. In these cases, NHD Catchments are a tiny fraction of the area (or total length) contained in the larger polygon. Target based weighting needs to retain a ton of precision to accurately represent these values, especially when rolling up to new interpolated target values. In contrast, source-based weighting is well represented by more basic decimals (usually 1; mean 0.993).

Using target-based weighting, most COMID weights are at a precision level of the first digit being x10^-6 or x10^-7, and with subsequent digits responsible for accuracy. Maybe we don't have an issue with crosswalks necessitating that high degree of precision in values, nor are we worried about file sizes, but I do worry about the potential that precision might be lost or long weighting values truncated, and that could have a very large effect on a final rescaling product.

dblodgett-usgs commented 4 days ago

Interesting nuance @mjcashman -- Precision of the cached values should be float64 IMHO. @rmcd-mscb -- let's work that into our arrangement to ensure that we capture the case where you have thousands of source polygons per target.

dblodgett-usgs commented 4 days ago

I just screen shared with @rmcd-mscb for a bit 🤗 and I think we have this figured out.

Handling target polygons with partial source polygon coverage is tricky for a few reasons. In the current ncdfgeom implementation of normalized weights, the weights always sum to 1 because I normalized to the intersecting portion of the target polygon area. In that form, you don't know if polygons have partial overlap.

In the current gdptools implementation, the weights are normalized to the total target polygon area so the weights don't always sum to 1. This means that you know whether a given target polygon has partial overlap or not.

This distinction comes out in the wash because when you apply the weights with data you do:

sum(val * w) / sum(w)

When the weights don't sum to 1, you divide by that less than one sum and inflate sum(val * w) which is missing the NA * w part of the sum!

The method in ncdfgeom returns a weight that is based on ONLY the intersecting portion of the target polygon so the sum(w) term is just always 1.

We get the same answer in both cases, but the weights look different! oyoyoy. I have an updated vignette and documentation in the works. Rich will pick up what I'm doing and add some specifics to gdptools to try and avoid future confusion.

lkoenig-usgs commented 4 days ago

@dblodgett-usgs, I agree that it's not likely to be consequential for the end result. I'm curious what you think, then, is the biggest advantage for returning 1 in these cases? I'm not sure if my experience is representative, but intuitively, I was expecting the behavior in gdptools, such that I can infer cases where I am using partial coverage in the interpolation.

dblodgett-usgs commented 4 days ago

I am going to switch the behavior here to match gdptools. Given that it doesn't matter and there is additional information in the fact that weights don't sum to 1, I think it makes sense to go with the way it was done in gdptools. Stay tuned for some way cleaner docs! :)

elliewhite-usgs commented 4 days ago

on face value, I would think a_b & normalize = TRUE is the same as b_a & normalize = FALSE, but I remember testing this out with our polygons and it not working out. it might be better to rename normalize with normalize_on_target or something that describes what normalize means. because even with normalize = FALSE we are still normalizing just on source areas. Right?

dblodgett-usgs commented 4 days ago

(I changed away from the a/b variable names and will talk about this as source / target here)

This function is always returning weights to translate from source to target.

When normalize = FALSE relative area of the source polygons is not factored into the weight so you need that area of the source polygons to determine the relative contribution of each source polygon if there are multiple source polygons per target polygon. I don't think there's any normalization happening in this case. It's just the raw proportion of each source polygon that intersects each target polygon.

elliewhite-usgs commented 3 days ago

OK working through the example helped. TY! I shrunk down source 3 and got essentially what source 2's value is for target 9; since it's the only box overlapping the target that makes sense. cool. Now if we used gdptool's way of calculating the weight by dividing by target geometry the value for target 9 would be much smaller; it would essentially be treating the non-overlapping areas as 0 value areas. I don't think that's the type of behavior we want. unless there's a different way of calculating the values in gdptools that I am not aware of....

dblodgett-usgs commented 2 days ago

It doesn't treat that area as 0 because we normalize the weights that sum to less than 1 when calculating the weighted statistic.

e.g. the complete statistic is like:

(1 .25 + 2 .5 + NA * .25) / (.25 + .5 + .25)

Note NA in there for the non-overlap is undefined so we drop that part:

(1 .25 + 2 .5) / (.25 + .5)

Since we aren't dividing by 1, we are dividing by .75, it ignores the NA and still gives you the average of the values that do intersect.