GenericMappingTools / gmt

The Generic Mapping Tools
https://www.generic-mapping-tools.org
Other
837 stars 348 forks source link

A simpler way to count values inside a circle #3942

Closed PaulWessel closed 3 years ago

PaulWessel commented 4 years ago

Description of the desired feature

Requests such as this come up from time to time. It is a reasonable request: Given a table of random x,y[,z] values, make a grid that counts how many point falls inside a circle of some radius centered on each grid node, or what the sum is. I do not think we have a simple answer. The one I gave is basically this recipe:

  1. Decide on your grid specifications and make a dummy empty grid with grdmath
  2. Dump the grid with grd2xyz to ascii
  3. Write a script that loops over the records in this list of nodes
  4. For each record, dump it to a point.txt file
  5. Run gmtselect -Cpoint.txt +dradius on the whole dataset, then count the number of records found
  6. Add the node location and the count to a growing output table
  7. Run xyz2grd on that table to make the final grid

Clearly, this is only easy for gurus. Am I missing a simpler approach? If not, then I think such a list of steps require a C solution. Various options:

I imagine the usual modifications: You may want the count, the sum, and there may be a 4th column with weights, or perhaps you want the mean, median or mode, or some other statistic. Maybe you want to normalize the result by the area of the circle. The "get points inside geographic circle" step gets used in many modules, e.g., grdfilter, but it reads a grid. If we wanted to handle all the potential desired results then I think a new option to an existing tool will be very cumbersome. So, maybe gmtcount is the best suggestion?

gmt count table -Rregion -Iinc [-r] -Ca|m|p|l|L|u|U|n|s [-W] -Ggridfile -N -Sradius[u]

PaulWessel commented 4 years ago

As you might have guessed, I am about to finish the work on this module, @seisman. I have not yet settled on a name. The process can compute a whole long list of statistics from the data inside the circle, from sum, n, mean, median, etc to std, MAD, inter-quartile range, rms, etc. I think this will be a very useful module. It cannot be part of gmtspatial since it is completely full of options, but it is a spatial tool that does statistics. gmtcount is a bit too specific and count is only one of 15 types of outputs. Maybe

Any you like? Any better?

joa-quim commented 4 years ago

So this is a module that produces grids but doesn't (yet) ingest them. On the other hand it now only looks inside circles, but it costs nothing to guess that it soon will accept all that gmtselect can select. So, what about

And talking on shapefiles, you realize that you are procrastinating heavily. There are two branches of GMT that need serious work:

  1. Finish the ogrread (passing the imported data to the GMT data structs)
  2. Make all map projections accept proj4/epsg
PaulWessel commented 4 years ago

Well, gmtselect does not work along grid nodes so it is quite different I think, and I think "shape" is a bit misleading. Most will think shapefile, but there is nothing about shape here. THe circle is not one of many shapes, it simply sets the neighborhood for each node, just like blockmean etc sets the neighborhood for each node in terms of a square. Don't know about you, but I have had to resort to the above 1-7 recipe too many times to let this go on. As for ogrread etc I have no defense but being distracted by all the department stuff. Please merge master into your branches to make sure they dont fall too far behind.

joa-quim commented 4 years ago

Yes, we start with circles but next come rectangles, hexagons (BTW, one time we'll need to address the hexagonal binning and plotting), then polygons and there we are: shapes. I know it brings the shapefiles but we can't be made hostage of ESRI naming.

There is no OGR branch to merge. All is in master.

PaulWessel commented 4 years ago

Hexagonal is nice but does not poduce a GMT compatible grid due to the oscillating coordinates of the centers. So that would require a whole machinery to store hexagon tables and plotting them. I guess we already know how to plot hexagons in plot so I can see how this goes. But seems like not of that needs to be implemented on day 1 (well 2) before gmt*** is useful, no?

joa-quim commented 4 years ago

Yes, the hexagonal thing is nice on its own and can be different from the issue in question. But it's exactly because we don't need them all in day one that I'm raising the shapes guess

joa-quim commented 4 years ago

Still on hexbin, I once start to think on this and even found a paper with the apparent quickest algorithm but didn't keep it. I thought on the possibility of having a 2 layers grid that each is a regular grid. But ofc, the plot would have to know about this particular grid structure.

PaulWessel commented 4 years ago

Yes, we start with circles but next come rectangles, hexagons (BTW, one time we'll need to address the hexagonal binning and plotting), then polygons and there we are: shapes. I know it brings the shapefiles but we can't be made hostage of ESRI naming.

I prefer to leave the shape issue off the name regardless of how many ways we may eventually add beyond distance. Maybe gmtgridstat ?

joa-quim commented 4 years ago

gmtgridstat is not bad.

PaulWessel commented 4 years ago

For GMT, we would need geo-hexagons, just like we have geo-ellipses and rectangles. So great circle radial spokes to get to the 6 vertices, then connect those vertices with great circles I think (since small circles would not be the same for adjacent hexagons.

PaulWessel commented 4 years ago

What does @seisman think about gmtgridstat ?

seisman commented 4 years ago

gmtgridstat isn't a bad name, but may be a little confusing with grdstat (although we don't have the module yet).

weiji14 commented 4 years ago

gmtpolystats? Is the intention to include circles, hexagons, (and other polygon shapes)?

PaulWessel commented 4 years ago

Problem with both gridstat and polystat is that they both imply something about statistics for grids and polygons, so not capturing the nature of the module. While they are way down on the list, words like summarize and analyze are closer to what the app does, but those are also very general terms. It partitions data then report statistics on each partition. So both splitting data and analyzing the bits. gmtareastat is better than gmtpolystat for that reason I think, but not happy with that either...

weiji14 commented 4 years ago

I want to say gmtgeostats but that might be too bold, since Geostatistics is a very broad field. The hexagonal binning makes me think of gmtbinstats, or something like gmtvectorstats.

PaulWessel commented 4 years ago

binstats is pretty good. geo is problematic since GMT is also used on Cartesian data. But we clearly are binning data. gmtbinning, gmtbindata, ...

PaulWessel commented 4 years ago

After a long night's sleep I am closer to making a decision on this shortlist:

  1. gmtbinarea
  2. gmtbinstats
  3. gmtbin
  4. gmtareastats

The reasoning is that we are in fact binning data (the bins may be circles for now, but I will add hexagonal tiling soon), the bins do represent an area, and we are computing statistics of data inside each bin.

Comments welcome, @GenericMappingTools/core and @GenericMappingTools/gmt-contributors and @GenericMappingTools/python. You can see an early adopter using this tool here.

seisman commented 4 years ago

I like gmtbinstats.

KristofKoch commented 4 years ago

I associate binary with the bin moniker. Especially in a shell/terminal/bash context.

PaulWessel commented 4 years ago

OK, but there are no GMT table modules that only read binary. They can read ascii, netcdf, shapefile, native binary, depending on settings. So "bin" in a GMT module name cannot mean binary, no?

seisman commented 4 years ago

What about blockstats?

joa-quim commented 4 years ago

gmtbinning, drop the stats. Count, min,max, sum are not exactly stats, and it closer in spirit to for example hexagonal binning.

PaulWessel commented 4 years ago

All the modules that use a verb use it as imperative. So we have contour, convert, regress, interpolate, etc. None are in the case ending in "-ing". So gmtbinning would be breaking the pattern. I think n, min, max etc are statistics. Mn/max are the 0 and 100% quantiles, for instance, n is sample size. So all regular statistics.

KristofKoch commented 4 years ago

OK, but there are no GMT table modules that only read binary. […] So "bin" in a GMT module name cannot mean binary, no?

I totally agree when standing in the GMT universe but looking from a unixoid background bin is already "taken". I have the feeling this might be a source of confusion for future users. My point may also be rooted in the fact that I don't use the word in a mathematical/topological context and it therefore has a much narrower meaning for me. Following the discussion here it looks like bin might be the best choice none the less.

PaulWessel commented 4 years ago

For geographic data the circle is great, but depending on radius the user may have data falling between circles. We probably should warn if that happens (circle to small). To cover all area with circle means circles will overlap, hence it is a filtering process (statistics at node A is somewhat correlated with statistics at node B next door). If you want complete independence then for geographic data you are going to have rectangular cells (think blockm) where the area of the cells vary. We have no representation of equal-area bins on a spherical Earth. What I have done in the past is to use an equal-area projection like -Jy and do blockmean on those projected data. For Cartesian data with same units in x and y the hexagonal tiling is optimal for independent statistics, while for Cartesian data with unequal units I think we must offer a rectangular bin, which is already what block does, except it does not deal with all the statistics. So I think the tool we discuss here can be limited to

  1. Spherical, geographic binning by (overlapping) circles
  2. Cartesian equal-axis binning by hexagonal tiling
  3. Cartesian equal-axis binning by (overlapping) circles
  4. Cartesian binning by rectangular tiling (equal or non-equal axes)

Thus, implementing a rectangle option to complement radius and hexagon seems like a good extension.

seisman commented 3 years ago

Close the issue? It seems the feature has been implemented in gmtbinstats.

PaulWessel commented 3 years ago

Yes, done.