Open JohnHalleyGotway opened 1 year ago
A couple of details: The Voronoi tessellation is unique. If you want to consider all parts of a geometric space closer to a given point than to any other point (from a set of points), this is what you will get. This also holds if you use something other than a Euclidean distance measure.
While checking out some history on Voronoi tessellation, I found that it was used (not under that name) by meteorologist Thiessen in 1911 -- https://en.wikipedia.org/wiki/Alfred_H._Thiessen So I'm not very original in suggesting it.
Masked Regions: Masked regions, I would handle the area consideration a little differently than perhaps you're thinking. (Also for other reasons I'll mention).
The Voronoi tessellation will divide the plane or surface of the sphere in to unique regions. One can find the area of those regions. But, since we're considering gridded model output, my mental image is to first assign grid cells to the observation point they are closest to and sum those grid cell_areas, as we sum the cell_area(model-obs) (et al.) for each of those grid points. So bias is sum[cell_area(model-obs)] / sum[cell_area]
Then there are elaborations which can be handled more or less neatly (again, grid cell area, not voronoi cell area):
For the latter two cases, consider air temperature observations over land across the US. Inside Oklahoma, it's the original straightforward case. But consider Honolulu, San Francisco, and Anchorage. On a voronoi tessellation of the globe, they will be partitioning the eastern Pacific Ocean, an enormous area. While it may be reasonable to use the land station for model points 20-100 km away, it is probably not reasonable to use them for points 1000 km away (hence user argument or a default of, say, 300 km). In some cases, it isn't a good idea to use land observations over the sea, or vice versa, so that's another user option (default to use everywhere, say).
Area Weighting of Contingency tables: Yes, one can certainly use areas in the tables (mathematically). I've been doing it for years, and wasn't being creative in doing so. Consider a lat-lon grid, 5 arcmin resolution. There are 4320 points in longitude. At 60 North, they cover 20,000 km length. At the last row, also 4320 points, they cover 29 km. For this reason, I avoid lat-lon grids. But I can't always do so.
In the future, modeling may well be moving to unstructured grids where, again, the grid cell sizes can be wildly different so area weighting will be important.
For most categorical measures, you can use the integer counts or the proportions, so it is ok to have floating points in there. However, I'm not sure what weighting by numbers of stations will do; I'd have to think about it more. Off hand, it seems like it might bias the results based on station density rather than actual occurrences for the cells.
I think the bigger issue concerns the masking. It does seem to make sense to mask out certain areas, like the ocean, from certain other areas. Can the dense regions be masked out from the less dense regions? That would be my preference.
Another, fairly simple, solution would be to find a coverage design and thin the network so that the densities are more similar across the spatial field. Of course, if you have some really sparsely dense areas, then you might still have a little bit of an issue (though less so). While such an approach uses fewer data, time-permitting, you could calculate the statistics over multiple coverage designs in order to get a sense of the uncertainty pertaining to the choice of point locations. I believe we are adding coverage design capabilities already?
As far as weighting denser areas more than less dense areas, it certainly is reasonable to use Voronoi tessellations.
Eric, a Voronoi Tesselation is illustrated in this picture -- Dividing North America in to domains nearest to cities with population over 500,000. For illustration, let's say that we have 1 observation for each of them. In the normal way, which I and John are referring to as weighting by number of stations, a rather modest area in the Northeast has 10 stations. An equal or larger area around Las Vegas (35N 115W, approx) has only one station. In computing model bias (say), the 10 stations in the northeast are 10 times as important as the 1 station in Las Vegas.
In/with area weighting, we say that the area nearest Las Vegas (which is what the Voronoi Tessellation determines, for all points) has the same weight as the area nearest those 10 stations covering the same area.
For specificity, consider the northeast stations all to have a 1 degree temperature bias, and Las Vegas to have 10 degrees. Station weighting gives a bias of (10 1 + 1 10)/11, or 1.8181... degrees and we might think the model is doing fairly well. Area weighting says it is (1 1 + 1 10)/2, for 5.5 degrees and we'd conclude the model is not doing very well.
Yes, I understood what you were suggesting. It is a little like what I suggested about using a coverage design to thin the network. I don't have much experience with Voronoi diagrams, so maybe they are faster to compute than the coverage design, which can be slow depending on various factors. A coverage design approach would allow for gleaning information about the uncertainty in the choice of the network, whereas I'm not sure how you would obtain anything similar with the Voronoi tessellations. For example, suppose that the bias in the 10 stations is variable (so maybe some are 10 and some are 1), then what would happen with the Voronoi approach? Do you use an average bias for that cell?
The speed of the tessellation is 'very fast'. Put it this way, I've been using a data set with the locations and populations of a bit over 40,000 cities around the globe. The difference in run time between plotting 28 north american metros over 2 million and all 42k cities with population over 0 is smaller than the run time variability. 15.1s wall time regardless.
It's mostly the time to read and parse the ascii data for the cities. Algorithms for Voronoi Tessellation are very advanced and well-optimized.
Yes, I think it is worth adding the Voronoi tessellations not only for this purpose but perhaps for other purposes down the road.
Note that this issue is also related to the point_weight_flag
requested by #2279.
Describe the New Feature
This enhancement was proposed by @rgrumbine via METplus discussion dtcenter/METplus#2315. As of MET version 11.1.0, when verifying against point observations, all points are treated equally. When points observations are not evenly distributed across a domain, as they almost never are, the resulting statistics over-sample from the more dense locations and under-sample from the less dense locations. This issue is to develop and implement an algorithm for addressing this representativeness problem.
@rgrumbine recommends applying Voroni tessellations to this problem, using the area of the voroni cell to weight the observation it contains. There are several details to consider:
Weighting should applied to the computation of categorical, continuous, and ensemble statistics. While the existing
grid_weight_flag
option applies to continuous statistics, it is NOT applied to categorical ones. Need to address this discrepancy. Would need to change MET library code to store floating point weights rather than just integer counts. Can contingency tables with integer counts simply be replaced by sums of floating point weights?Need to add a corresponding
point_weight_flag
to Point-Stat and Ensemble-Stat, presumably with options forNone
andVORONI
. Should other algorithms be considered as well?How do Voroni tesselations interact with masking regions? Points outside the mask are basically treated as missing data values. Do masking regions and embedded missing data impact the Voroni weight computations?
How to build acceptance for this proposed new algorithm?
Acceptance Testing
List input data types and sources. Describe tests required for new functionality.
Time Estimate
Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.
Sub-Issues
Consider breaking the new feature down into sub-issues.
Relevant Deadlines
List relevant project deadlines here or state NONE.
Funding Source
Define the source of funding and account keys here or state NONE.
Define the Metadata
Assignee
Labels
Projects and Milestone
Define Related Issue(s)
Consider the impact to the other METplus components.
New Feature Checklist
See the METplus Workflow for details.
feature_<Issue Number>_<Description>
feature <Issue Number> <Description>