geoalchemy / geoalchemy2

Geospatial extension to SQLAlchemy
http://geoalchemy-2.readthedocs.org
MIT License
629 stars 112 forks source link

[Bug report] Spatialite queries don't appear to use Spatial Indexes? #520

Open choosehappy opened 1 month ago

choosehappy commented 1 month ago

Describe the bug

I set up a simple sqlachelmy database with a single geometry:


from sqlalchemy_utils import database_exists, create_database
db_path = 'sqlite:///spatialite_ray.db'
engine = create_engine(db_path, echo=False)  # Set echo=True to see SQL commands being executed

# Initialize Spatialite extension
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    dbapi_connection.enable_load_extension(True)
    dbapi_connection.execute('SELECT load_extension("mod_spatialite")')
    dbapi_connection.execute('SELECT InitSpatialMetaData(1)')

# Create a base class for our declarative mapping
Base = declarative_base()

# Define your SQLAlchemy model
class GeometryModel(Base):
    __tablename__ = 'geometries'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    geom = Column(Geometry('POLYGON'))

Base.metadata.create_all(engine)

I then query this for an intersection based on an arbitraty bounding box:

%%time
query = session.query(GeometryModel).filter(GeometryModel.geom.ST_Intersects(func.BuildMbr(
            centroid_x - half_bbox_size,
            centroid_y - half_bbox_size,
            centroid_x + half_bbox_size,
            centroid_y + half_bbox_size
        )))

results=query.all()
len(results)

This yields the correct results, but takes a very long time (~7 seconds), versus PostGIS (~43ms)

for the engine i set echo = True and see the query:


2024-08-14 00:25:45,980 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-14 00:25:45,984 INFO sqlalchemy.engine.Engine SELECT geometries.id AS geometries_id, geometries.name AS geometries_name, AsEWKB(geometries.geom) AS geometries_geom 
FROM geometries 
WHERE ST_Intersects(geometries.geom, BuildMbr(?, ?, ?, ?))
2024-08-14 00:25:45,986 INFO sqlalchemy.engine.Engine [no key 0.00186s] (29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455)

I copy and paste this into sqlite3 and see that indeed a full scan has taken place:

sqlite> explain query plan  SELECT geometries.id AS geometries_id, geometries.name AS geometries_name, AsEWKB(geometries.geom) AS geometries_geom
FROM geometries
WHERE ST_Intersects(geometries.geom, BuildMbr(29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455));
;
QUERY PLAN
`--SCAN geometries

even though it does indeed think an index exists:

sqlite> .tables
ElementaryGeometries                spatial_ref_sys_all
KNN                                 spatial_ref_sys_aux
SpatialIndex                        spatialite_history
data_licenses                       sql_statements_log
geom_cols_ref_sys                   vector_layers
geometries                          vector_layers_auth
geometry_columns                    vector_layers_field_infos
geometry_columns_auth               vector_layers_statistics
geometry_columns_field_infos        views_geometry_columns
geometry_columns_statistics         views_geometry_columns_auth
geometry_columns_time               views_geometry_columns_field_infos
idx_geometries_geom                 views_geometry_columns_statistics
idx_geometries_geom_node            virts_geometry_columns
idx_geometries_geom_parent          virts_geometry_columns_auth
idx_geometries_geom_rowid           virts_geometry_columns_field_infos
spatial_ref_sys                     virts_geometry_columns_statistics
sqlite> SELECT * FROM geometry_columns;
geometries|geom|3|2|-1|1
sqlite>

Is this the expected behavior? How can one enforce SpatialIndex to be used?

Optional link from https://geoalchemy-2.readthedocs.io which documents the behavior that is expected

No response

To Reproduce

see code above

Error

Spatial Index is not being used for intersection query

Additional context

No response

GeoAlchemy 2 Version in Use

0.15.2

Python Version

3.10.12

Operating system

Linux

adrien-berchet commented 1 month ago

Hi @choosehappy As far as I can see, GeoAlchemy2 translates the query as it is intended. In my opinion this is more related to SpatiaLite. Maybe you can try to revert the WHERE clause? WHERE ST_Intersects(BuildMbr(29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455), geometries.geom);

choosehappy commented 1 month ago

Thanks for your quick feedback. My understanding was that GeoAlchemy2's goal is to translate queries into the specific dialects, keeping in mind their unique caveats.

I've done a lot of reading on this, and also my understanding (please someone correct me if i'm wrong!!), is that Spatilate does not "organically" use the R-tree index as it should, and instead this must be explicitly modeled using a "IN" statement

https://www.gaia-gis.it/fossil/libspatialite/wiki?name=SpatialIndex https://www.gaia-gis.it/gaia-sins/spatialite-cookbook/html/rtree.html https://blog.jrg.com.br/2016/05/23/Querying-spatial-data-using-spatialite-with-indexes/index.html

where we see essentially a similar query written like this:

SELECT * 
FROM com2011 AS c, prov2011 AS p
WHERE ST_CoveredBy(c.geometry, p.geometry) = 1 
  AND nome_pro = 'AREZZO' AND c.ROWID IN (
    SELECT ROWID 
    FROM SpatialIndex
    WHERE f_table_name = 'com2011' 
        AND search_frame = p.geometry);

noting the SpatialIndex is used as an index to select the rows

So while the GeoAlchemy2 query does yield the correct answer, as designed it would not follow this convention, and thus will always perform a full-table scan instead of leveraging the index.

I wanted to first confirm if (a) that is other folks understanding, and (b) if that is the intended functionality?

Does that make sense?

adrien-berchet commented 1 month ago

My understanding was that GeoAlchemy2's goal is to translate queries into the specific dialects, keeping in mind their unique caveats.

Well, yes, to a certain extent. Basically, it's goal is to provide a uniform syntax which is then translated to any SQL dialect (Postgresql, MySQL, SQLite, etc.). But this translation is explicit, GeoAlchemy2 is not supposed to fix the queries depending on the dialect.

So while the GeoAlchemy2 query does yield the correct answer, as designed it would not follow this convention, and thus will always perform a full-table scan instead of leveraging the index.

Indeed, that's unfortunate but it's because Spatialite does not follow the widely used conventions, so extra work is required by users.

So your conclusion makes sense and even using GeoAlchemy2 you will have to create custom queries for this dialect.

choosehappy commented 1 month ago

Okay! Perfect, thanks for your confirmation and i understand. I thought i was going crazy there for a second : )

Out of curiosity, do you have a sense of how onerous it would be to writer a spatialite parser to internally automatically convert into the more efficient version?

I haven't checked yet, but my suspicion is that the difference in computer time will go from 7s to < 10ms, so i might be able to justify the investment if its fairly straightforward and tight in score

Thoughts?

adrien-berchet commented 1 month ago

As far as I understand, what you are talking about is basically a query planner for the query planners, so it's quite a huge work :⁠-⁠)

But it's surprising that the real query planner is not able to use the spatial index in your case, which seems quite simple. Maybe you use an old version of Spatialite?

choosehappy commented 1 month ago

not that old - i have a fresh ubuntu instal and did an apt install

sqlite> SELECT load_extension("mod_spatialite");

sqlite> select sqlite_version(), spatialite_version(), proj4_version(), geos_version()
   ...> ;
3.37.2|5.0.1|Rel. 8.2.1, January 1st, 2022|3.10.2-CAPI-1.16.0
sqlite>

looks like 5.0.1 is installed, and the absolute latest is 5.1

doesn't seem to be any related changes in at least the last 4 years:

https://www.gaia-gis.it/fossil/libspatialite/timeline

from my understand of the documents its not "in my case", i'm pretty sure its in any case it doesn't use the spatial index - one needs to write queries specifically to take advantage of it?

adrien-berchet commented 1 month ago

Indeed, as far as I understand, Spatialite is just never able to use the spatial index on its own, which is quite a pity.

But after some more thinking, I guess we could create a helper to update the queries for Spatialite. Basically we would have to check if the spatial index exists and, if it does, update the query to explicitly use the spatial index. I'm not sure how to do this with SQLAlchemy though, some investigation would be required but I won't have time for this for a while. If anyone has time to investigate this it could be very interesting.

choosehappy commented 1 month ago

Wanted to quickly report back --- this is performance on a spatialite database with ~1 million rows

TL;DR - spatial index improves performance by ~16x : )

Query consists of: 1) loads a random polygon from the database 2) gets the centroid 3) creates a bounding box around that centroid of a specific size 4) does a spatial query against the 1 million rows to find intersections

With the parameters below ~250 rows are returned (naturally longer query times for larger result sets, and "AsGeoJSON" ain't exactly free....)

Naive approach (full table scan):

1.62 s ± 34.8 ms per loop


%%timeit
random_index = random.randint(1, total_rows)

half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
    random_row = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid 
        FROM geometries 
        WHERE id = {random_index}
        ''')
    ).fetchone()

    centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']

    bounding_box_polygons = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(geom) as geom 
        FROM geometries 
        WHERE MbrIntersects(
            geom,
            BuildMBR(
                {centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
                {centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
            )
        )
        ''')
     ).fetchall()

print(len(bounding_box_polygons), end= " " )

Rewritten to use the Spatial index:

97.6 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit

random_index = random.randint(1, total_rows)

half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
    random_row = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid 
        FROM geometries 
        WHERE id = {random_index}
        ''')
    ).fetchone()

    centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']

    bounding_box_polygons = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(geom) as geom 
        FROM geometries 
        WHERE geometries.ROWID IN (
        SELECT ROWID
        FROM SpatialIndex
        WHERE f_table_name = 'geometries'
           AND search_frame =             BuildMBR(
                {centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
                {centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
            ))
        ''')
     ).fetchall()

print(len(bounding_box_polygons), end= " " )

Thought technically according to the documentation:

https://www.gaia-gis.it/fossil/libspatialite/wiki?name=SpatialIndex

*1. VirtualSpatialIndex always take care to slightly extend by a little bit the filtering rectangle; so no Geometry will never badly lost.

  1. Eventually, more Geometries than strictly expected could be returned; but this never is an issue, because a Spatial Index query simply is a quick approximative spatial filter necessarily implying some more accurate refinement e.g. by using ST_Intersects().*

so one needs to refine the query, like so, but this had an inconsequential impact on search time:

93.5 ms ± 20.1 ms per loop

%%timeit

random_index = random.randint(1, total_rows)

half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
    random_row = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid 
        FROM geometries 
        WHERE id = {random_index}
        ''')
    ).fetchone()

    centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']

    bounding_box_polygons = conn.execute(
        text(f'''
        SELECT id, AsGeoJSON(geom) as geom 
        FROM geometries 
        WHERE MbrIntersects(
            geom,
            BuildMBR(
                {centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
                {centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
            )
        ) AND geometries.ROWID IN (
        SELECT ROWID
        FROM SpatialIndex
        WHERE f_table_name = 'geometries'
           AND search_frame =             BuildMBR(
                {centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
                {centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
            ))
        ''')
     ).fetchall()

print(len(bounding_box_polygons), end= " " )
choosehappy commented 1 month ago

, I guess we could create a helper to update the queries for Spatialite.

I took a look at the code base here and its very much not obvious to me how one would even start. Is there someplace i could look to get a general sense of how this would work?

adrien-berchet commented 1 month ago

Wanted to quickly report back --- this is performance on a spatialite database with ~1 million rows

TL;DR - spatial index improves performance by ~16x : )

Yeah, spatial indexes are powerful ;-)

I took a look at the code base here and its very much not obvious to me how one would even start. Is there someplace i could look to get a general sense of how this would work?

I think that the first step is to write the 'right' query manually using GeoAlchemy2. Then if we want to create a function to update the query automatically, we will have to introspect the 'naive' query (especially the query.whereclause part) to detect if there is a spatial sub-clause in the WHERE clause, then remove this sub-clause and add the optimized sub-query that uses the spatial index. To be honest I don't really know how to do that. I think it's possible to manipulate the query.whereclause.clauses, which is a kind of list of clauses, to remove what we want. Then we can just do something like new_query = query.where(spatial_index_sub_query) to add what we want.

Here is quick POC which adds the spatial index subquery. I think it works and I'm actually not sure we want to remove the initial spatial WHERE clauses, since the spatial index is just just a quick approximative filter, so we still need to perform the actual spatial check on the filtered geometries. So maybe this POC is already complete, I'm not sure :smiley: Note that I didn't run any benchmark on this, so maybe it does not work correctly yet.

from functools import partial

from sqlalchemy import Column
from sqlalchemy import Table
from sqlalchemy import create_engine
from sqlalchemy import Integer
from sqlalchemy import MetaData
from sqlalchemy import func
from sqlalchemy import select
from sqlalchemy import text
from sqlalchemy.event import listen
from sqlalchemy.orm import declarative_base

from geoalchemy2 import Geometry
from geoalchemy2 import load_spatialite

db_connection_url = "sqlite:///test_database.sqlite"
engine = create_engine(db_connection_url, echo=True)
load_spatialite_wgs84 = partial(load_spatialite, init_mode="WGS84")
listen(engine, "connect", load_spatialite_wgs84)
conn = engine.connect()

metadata = MetaData()
Base = declarative_base(metadata=metadata)

class Point(Base):
    __tablename__ = "point"
    id = Column(Integer, primary_key=True)
    geom = Column(Geometry(srid=4326, geometry_type="POINT"))

metadata.drop_all(conn, checkfirst=True)
metadata.create_all(conn)
point_spatial_index = Table("SpatialIndex", metadata, autoload_with=conn)

print("=================================================")
print(point_spatial_index)
print("=================================================")

# Insert point
conn.execute(
    Point.__table__.insert(),
    [
        {"geom": "SRID=4326;POINT (0 0)"},
        {"geom": "SRID=4326;POINT (1 1)"},
    ],
)

# Query point
compared_element = "SRID=4326;POLYGON((-1 -1, -1 1, 1 1, 1 -1, -1 -1))"
query = Point.__table__.select().where(Point.__table__.c.geom.ST_Intersects(compared_element), Point.__table__.c.geom.ST_Overlaps(compared_element))

print("#################################################")
print(query)
print("#################################################")

index_query = select(text("ROWID")).where(point_spatial_index.c.f_table_name == Point.__tablename__).where(point_spatial_index.c.search_frame == func.BuildMBR(-10, -10, 10, 10))

print("+++++++++++++++++++++++++++++++++++++++++++++++++")
print(index_query)
print("+++++++++++++++++++++++++++++++++++++++++++++++++")

new_query = query.where(Point.__table__.c.id.in_(index_query))

print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")
print(new_query)
print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")

res = conn.execute(new_query).fetchall()

print("*************************************************")
print(res)
print("*************************************************")