DOI-USGS / ncdfgeom

NetCDF-CF Geometry and Timeseries Tools for R: https://code.usgs.gov/water/ncdfgeom
https://doi-usgs.github.io/ncdfgeom/
18 stars 8 forks source link

Question about using `normalize = TRUE` in `calculate_area_intersection_weights` #102

Open lkoenig-usgs opened 5 days ago

lkoenig-usgs commented 5 days ago

I'm a little unsure about the usage of normalize = TRUE in the calculate_area_intersection_weights function and hoping you can help clarify. My use case is calculating the intersection weights between NHDPlusv2 catchments and NHGFv1.1 HRUs (working with Ellie W on this).

Specifically, I'm wondering about using normalize = TRUE to return the fractional area of a target polygon (e.g., NHGF) that is covered by the source (e.g., NHDv2).

# as a reprex, wrangle a pseudo NHGF target polygon using some NHD catchments
comids <- c(4648620, 4648714, 4648568, 4648542, 4648710)
starting_polygons <- nhdplusTools::get_nhdplus(comid = comids, realization = "catchment", t_srs = 5070)
target_areas <- starting_polygons |>
  mutate(id = 1) |>
  group_by(id) |>
  summarize(do_union = TRUE) 

# looking at a case where the source polygons don't completely overlap the target polygons, 
# like, say, along the U.S.-Canada border
source_areas <- filter(starting_polygons, featureid %in% c(4648710, 4648542, 4648620))

image

I assumed that for the target polygon (blue), the weights would only sum to 1 if it was fully covered by the source polygons (red), so I was surprised that the summed weights equal one in this case:

weights <- ncdfgeom::calculate_area_intersection_weights(
     x = select(source_areas, featureid, geometry), 
     y = select(target_areas, id, geometry), 
     normalize = TRUE
 )
# I was expecting something like 0.68. Am I misinterpreting something?
sum(weights$w)
#> 1

The crux of my question is, if normalize = TRUE, is the intended behavior that the intersected areas are divided by the total intersecting area (as seems to be the case here), or by the target area? I'd expect those two to be the same if the target areas were completely overlapping with the source areas, but not in cases like the example here.

If the latter, maybe the target area could be added as a field to y, and then the totalArea_y = unique(target_area)?

Thanks!

dblodgett-usgs commented 4 days ago

I'm rereading my documentation realizing it could be much more clear. It's been a bit since I looked at this and even then it was super confusing... so bear with me.

From the examples:

# say we have data from `a` that we want sampled to `b`.
# this gives the percent of each `a` that intersects each `b`

(a_b <- calculate_area_intersection_weights(a, b, normalize = FALSE))

# note that `w` sums to 1 where `b` completely covers `a`. [that is where the source completely covers the target]

In your example, the polygons from your source areas sum to 1 per target polygon. To apply weights, you need the area of each of your source polygons for weighted sum per target polygon.

So in this case, area is the area of your target polygons and you would do this grouped by source polygon id.

sum( (val * w * area), na.rm = TRUE ) / sum(w * area)

The fact that there is no data over a portion of the target polygon is essentially ignored here and our area-weighted mean is only taking into account the source polygons that actually intersect the target.

Is this helping? I think it's that partial overlap that is confusing. I'm not going to claim to have great ways to explain this! Will leave this issue open to clean up the documentation. A PR would be MUCH appreciated!

lkoenig-usgs commented 4 days ago

Thanks, Dave! I definitely agree that the partial overlap was the point of confusion for me. I asked about the intended behavior because I saw some references to the target polygon area (like here and here) rather than the intersected area. I can think about ways this may have been clearer to me as a user.

On one hand, it makes sense to me that the summed weights would return 1 given the name of the argument (i.e., normalize = TRUE). On the other hand, I usually use these summed weights as a check (sum to 1? ok, good!), and this edge case was kind of silently passing through without consideration. I'm open to reconsidering that, of course, especially in light of your documentation, which as you point out, states this assumption about complete overlap. But this also seems inconsistent with the behavior in gdptools, which I think uses the target area to compute the weights (again, I wouldn't expect any difference for the "usual" case of complete overlap, but that this distinction would create differences in edge cases like this one).

dblodgett-usgs commented 1 day ago

Yeah... My sense of it is that I've never gotten (and consistently used) a good set of words for the arguments of the functions involved in these workflows.

"Source data polygons" and "target polygons" should probably be used throughout?

This normalized nuance is really just whether the weights have been normalized so you don't need to know the area of each source data polygon when calculating your area weights. The weights are a cached intermediate value and it's really just convention to save the weights in the normalized form or not. gdptools does it with normalized=TRUE and historically, I had been doing it with normalized=FALSE.

My next steps with this work is to rewrite geoknife's internals to either call the gdptools-based web service or use ncdfgeom to iterate over a dataset with weights as we build here -- in that work, I'll be sure to clear all this up. With geoknife, it will be important that the cached weights are coherent and that we know what is what in using the cached values. I may just kill off the "normalized = FALSE" mode to be honest, I'm not sure it's really adding any value at this point.