Update "Spatial Indexes" with more info re: spatial joins

rmloveland commented 4 years ago

Richard Loveland (rmloveland) commented:

As ably suggested by @rytaft, in addition to the information being added to the new 'Spatial Indexes' page in #8617, we should update the docs with more information about how spatial joins work, including examples.

8563
7695

Jira Issue: DOC-799

rytaft commented 4 years ago

Hi @rmloveland, There are a lot of examples of spatial filtering and joining in the PostGIS tutorial, where the queries will use an index if it's available. For example:

CREATE TABLE nyc_census_blocks (
  gid serial PRIMARY KEY,
  blkid varchar(15),
  popn_total float8,
  popn_white float8,
  popn_black float8,
  popn_nativ float8,
  popn_asian float8,
  popn_other float8,
  boroname varchar(32),
  geom GEOMETRY(MULTIPOLYGON,4326),
  INVERTED INDEX nyc_census_blocks_geo_idx (geom)
);

CREATE TABLE nyc_neighborhoods (
  gid serial PRIMARY KEY,
  boroname varchar(43),
  name varchar(64),
  geom GEOMETRY(MULTIPOLYGON,4326),
  INVERTED INDEX nyc_neighborhoods_geo_idx (geom)
);

ALTER TABLE nyc_census_blocks INJECT STATISTICS '[
  {
    "columns": ["gid"],
    "created_at": "2018-01-01 1:00:00.00000+00:00",
    "row_count": 38794,
    "distinct_count": 38794
  },
  {
    "columns": ["boroname"],
    "created_at": "2018-01-01 1:00:00.00000+00:00",
    "row_count": 38794,
    "distinct_count": 5
  }
]';

ALTER TABLE nyc_neighborhoods INJECT STATISTICS '[
  {
    "columns": ["gid"],
    "created_at": "2018-01-01 1:00:00.00000+00:00",
    "row_count": 129,
    "distinct_count": 129
  },
  {
    "columns": ["boroname"],
    "created_at": "2018-01-01 1:00:00.00000+00:00",
    "row_count": 129,
    "distinct_count": 5
  },
  {
    "columns": ["name"],
    "created_at": "2018-01-01 1:00:00.00000+00:00",
    "row_count": 129,
    "distinct_count": 129
  }
]';

-- This query calculates the population density of two different neighborhoods
-- in New York City.
EXPLAIN SELECT
  n.name,
  Sum(c.popn_total) / (ST_Area(n.geom) / 1000000.0) AS popn_per_sqkm
FROM nyc_census_blocks AS c
JOIN nyc_neighborhoods AS n
ON ST_Intersects(n.geom, c.geom) AND c.boroname = n.boroname
WHERE n.name = 'Upper West Side'
OR n.name = 'Upper East Side'
GROUP BY n.name, n.geom;

should output:

              tree              |         field         |                       description
--------------------------------+-----------------------+-----------------------------------------------------------
                                | distribution          | full
                                | vectorized            | false
  render                        |                       |
   └── group                    |                       |
        │                       | group by              | name, geom
        └── lookup join         |                       |
             │                  | table                 | nyc_census_blocks@primary
             │                  | equality              | (gid) = (gid)
             │                  | equality cols are key |
             │                  | pred                  | st_intersects(geom, geom) AND (boroname = boroname)
             └── inverted join  |                       |
                  │             | table                 | nyc_census_blocks@nyc_census_blocks_geo_idx
                  └── filter    |                       |
                       │        | filter                | (name = 'Upper West Side') OR (name = 'Upper East Side')
                       └── scan |                       |
                                | estimated row count   | 129
                                | table                 | nyc_neighborhoods@primary
                                | spans                 | FULL SCAN
(18 rows)

Notice the inverted join which uses the nyc_census_blocks_geo_idx inverted index.

(Note that I've created tables and injected statistics above. If you actually have the data from the tutorial loaded, you should be able to just run the query without creating tables or injecting statistics, although you may need to add the inverted indexes.)

For some other examples of queries in the tutorial that can take advantage of indexes, see https://github.com/cockroachdb/cockroach/blob/master/pkg/sql/opt/xform/testdata/external/postgis-tutorial-idx (don't worry about the format of that file, you should be able to just copy the queries and run those on the data from the tutorial, although similar to above you may need to create indexes).

Let me know if you have questions or want to discuss specifics, or if you'd rather not use examples from the PostGIS tutorial. We can probably adapt some of those queries to a different schema, or just use them as inspiration....

rmloveland commented 4 years ago

Thanks @rytaft, this is very helpful! I will play with these queries and try to come up with some interesting questions / things to discuss. We definitely need to get more of this info about filtering/joining in the spatial indexes docs.

cockroachdb / docs

Update "Spatial Indexes" with more info re: spatial joins #8638

8563

7695