Support a mean aggregation strategy

keller-mark commented 6 months ago

Currently, the implementation always takes the sum. However we might want to use a different aggregation strategy, in particular, mean. This is possible with the DeckGL ContourLayer, for example: https://deck.gl/docs/api-reference/aggregation-layers/contour-layer#aggregation

This PR adds a .aggregation('MEAN') option to specify this as an aggregation strategy.

keller-mark commented 6 months ago

I don't understand the use case this addresses. It would be helpful to have a concrete example.

I am using d3-contour to obtain the contour thresholds only. I am not using the geometries.

Therefore, I need the contour threshold values to be meaningful relative to the per-dataPoint weights. With sum-based aggregation, the contour threshold values cannot be compared to the per-dataPoint weights because the thresholds depend on the number of data points in a local region.

Screenshot 2024-03-21 at 4 44 17 PM

Shouldn't there be a value independent from the weights?

I am not sure i follow. Are you saying there needs to be support for a separate accessor to be passed like .value()?

contourDensity()
      .x(d => d.x)
      .y(d => d.y)
      .weight(d => d.weight)
      .value(d => d.value)

Fil commented 6 months ago

(Conceptually for me weights should be "multiplicators", not the same as "value". However it seems that the formulas end up being the same for both "sum" and "mean", so maybe I should just accept that it isn't an issue if the term is "wrong", as long as the math works out.)

As I understand it, the algorithm in deck.gl is only binning — that is, points are grouped into square bins and then reduced (by count, sum, mean, min or max). You can do this more or less efficiently in D3 with something like https://observablehq.com/@fil/n-dimensions-binning

What D3’s contourDensity method does is a bit similar to binning, but with a more efficient code path that doesn't actually computes bins to reduce them later—it just increments a counter on a grid. Additionally, the counter is fractional, to apply bilinear interpolation.

A second step diffuses the resulting grid with a "bandwidth" (using blur2 under the hood (itself an optimized version of the older code of the density function)). The extent of the resulting values on the grid is then used to determine the thresholds.

Finally, for each threshold a contour is computed with the marching squares algorithm. (But you're saying that you don't need that part, just the thresholds?)

Your question made me go on a tangent. I've tried to replicate the chart you show above (sort of), by making a map of the 191,842 building/addresses in the French département of Loir-et-Cher, and coloring them by their street number (using a log10 scale). The map shows densities (in cities and villages, and on both banks of the rivers) where these numbers are in the 10s, and a large area in the south east where numbers are in the 100s and sometimes 1000s: these are mostly forest areas, with long non interrupted roads, where the numbering system looks quite different. (This was a new insight to me! Fun!)

map0

Now, if I want to compute the extent of the values on that map, I don't need to place points and do local averages… I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile. In my case, the 95th percentile means 2.54 instead of 3.95. If I want about 20 thresholds, this gives me: d3.ticks(0, d3.quantile(C, 0.95), 20) // [0, 0.1, … 2.4, 2.5]

If I want to identify the region where the local average is higher than, say, the 90% percentile, I can feed that number directly to the random-walk spatial interpolator. With a bit of blurring to make smooth contours, it results in this map:

map1

Finally, this should be “clipped” so that it doesn't extend too much into the “void” (which are not void in that case, but nearby départements). To do this, I'm using a modified version of interpolatorRandomWalk with a maximum distance threshold (https://github.com/observablehq/plot/issues/2032):

map2

I hope some of this makes sense for what you are doing. I started by prototyping a "mean" spatial interpolator, but the results I got on this dataset were not great, because the "void" had too much of an impact on the "local mean". (I'll share the code soon.)

Note: I'm using Plot because it's faster to iterate — but you can use its spatial interpolators as independent functions: they're not tied to Plot charts.

keller-mark commented 6 months ago

I can just work on the values (C) and ignore the locations. Averaging locally might tone down a few outliers and shrink this extent a little, but I can mimic that by taking a percentile.

Thank you so much for this suggestion! I am playing with this idea now. I think it is a better fit for my use case, as it will make a legend for the contours more interpretable (can be in terms of the percentiles) and also because the gene expression distributions can be very skewed so mean is probably not the best option anyway.

Further, I had previously considered pre-computing the contour thresholds and/or the grids server-side, but with this approach I may be able to pre-compute the percentiles, which would be much simpler.

d3 / d3-contour

Support a mean aggregation strategy #78