Closed choosehappy closed 1 week ago
Hi @choosehappy
As far as I can see, GeoAlchemy2 translates the query as it is intended. In my opinion this is more related to SpatiaLite.
Maybe you can try to revert the WHERE clause? WHERE ST_Intersects(BuildMbr(29139.1654723127, 29864.74657980455, 31139.1654723127, 31864.74657980455), geometries.geom);
Thanks for your quick feedback. My understanding was that GeoAlchemy2's goal is to translate queries into the specific dialects, keeping in mind their unique caveats.
I've done a lot of reading on this, and also my understanding (please someone correct me if i'm wrong!!), is that Spatilate does not "organically" use the R-tree index as it should, and instead this must be explicitly modeled using a "IN" statement
https://www.gaia-gis.it/fossil/libspatialite/wiki?name=SpatialIndex https://www.gaia-gis.it/gaia-sins/spatialite-cookbook/html/rtree.html https://blog.jrg.com.br/2016/05/23/Querying-spatial-data-using-spatialite-with-indexes/index.html
where we see essentially a similar query written like this:
SELECT *
FROM com2011 AS c, prov2011 AS p
WHERE ST_CoveredBy(c.geometry, p.geometry) = 1
AND nome_pro = 'AREZZO' AND c.ROWID IN (
SELECT ROWID
FROM SpatialIndex
WHERE f_table_name = 'com2011'
AND search_frame = p.geometry);
noting the SpatialIndex is used as an index to select the rows
So while the GeoAlchemy2 query does yield the correct answer, as designed it would not follow this convention, and thus will always perform a full-table scan instead of leveraging the index.
I wanted to first confirm if (a) that is other folks understanding, and (b) if that is the intended functionality?
Does that make sense?
My understanding was that GeoAlchemy2's goal is to translate queries into the specific dialects, keeping in mind their unique caveats.
Well, yes, to a certain extent. Basically, it's goal is to provide a uniform syntax which is then translated to any SQL dialect (Postgresql, MySQL, SQLite, etc.). But this translation is explicit, GeoAlchemy2 is not supposed to fix the queries depending on the dialect.
So while the GeoAlchemy2 query does yield the correct answer, as designed it would not follow this convention, and thus will always perform a full-table scan instead of leveraging the index.
Indeed, that's unfortunate but it's because Spatialite does not follow the widely used conventions, so extra work is required by users.
So your conclusion makes sense and even using GeoAlchemy2 you will have to create custom queries for this dialect.
Okay! Perfect, thanks for your confirmation and i understand. I thought i was going crazy there for a second : )
Out of curiosity, do you have a sense of how onerous it would be to writer a spatialite parser to internally automatically convert into the more efficient version?
I haven't checked yet, but my suspicion is that the difference in computer time will go from 7s to < 10ms, so i might be able to justify the investment if its fairly straightforward and tight in score
Thoughts?
As far as I understand, what you are talking about is basically a query planner for the query planners, so it's quite a huge work :-)
But it's surprising that the real query planner is not able to use the spatial index in your case, which seems quite simple. Maybe you use an old version of Spatialite?
not that old - i have a fresh ubuntu instal and did an apt install
sqlite> SELECT load_extension("mod_spatialite");
sqlite> select sqlite_version(), spatialite_version(), proj4_version(), geos_version()
...> ;
3.37.2|5.0.1|Rel. 8.2.1, January 1st, 2022|3.10.2-CAPI-1.16.0
sqlite>
looks like 5.0.1 is installed, and the absolute latest is 5.1
doesn't seem to be any related changes in at least the last 4 years:
https://www.gaia-gis.it/fossil/libspatialite/timeline
from my understand of the documents its not "in my case", i'm pretty sure its in any case it doesn't use the spatial index - one needs to write queries specifically to take advantage of it?
Indeed, as far as I understand, Spatialite is just never able to use the spatial index on its own, which is quite a pity.
But after some more thinking, I guess we could create a helper to update the queries for Spatialite. Basically we would have to check if the spatial index exists and, if it does, update the query to explicitly use the spatial index. I'm not sure how to do this with SQLAlchemy though, some investigation would be required but I won't have time for this for a while. If anyone has time to investigate this it could be very interesting.
Wanted to quickly report back --- this is performance on a spatialite database with ~1 million rows
TL;DR - spatial index improves performance by ~16x : )
Query consists of: 1) loads a random polygon from the database 2) gets the centroid 3) creates a bounding box around that centroid of a specific size 4) does a spatial query against the 1 million rows to find intersections
With the parameters below ~250 rows are returned (naturally longer query times for larger result sets, and "AsGeoJSON" ain't exactly free....)
Naive approach (full table scan):
1.62 s ± 34.8 ms per loop
%%timeit
random_index = random.randint(1, total_rows)
half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
random_row = conn.execute(
text(f'''
SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid
FROM geometries
WHERE id = {random_index}
''')
).fetchone()
centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']
bounding_box_polygons = conn.execute(
text(f'''
SELECT id, AsGeoJSON(geom) as geom
FROM geometries
WHERE MbrIntersects(
geom,
BuildMBR(
{centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
{centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
)
)
''')
).fetchall()
print(len(bounding_box_polygons), end= " " )
Rewritten to use the Spatial index:
97.6 ms ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
random_index = random.randint(1, total_rows)
half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
random_row = conn.execute(
text(f'''
SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid
FROM geometries
WHERE id = {random_index}
''')
).fetchone()
centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']
bounding_box_polygons = conn.execute(
text(f'''
SELECT id, AsGeoJSON(geom) as geom
FROM geometries
WHERE geometries.ROWID IN (
SELECT ROWID
FROM SpatialIndex
WHERE f_table_name = 'geometries'
AND search_frame = BuildMBR(
{centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
{centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
))
''')
).fetchall()
print(len(bounding_box_polygons), end= " " )
Thought technically according to the documentation:
https://www.gaia-gis.it/fossil/libspatialite/wiki?name=SpatialIndex
*1. VirtualSpatialIndex always take care to slightly extend by a little bit the filtering rectangle; so no Geometry will never badly lost.
so one needs to refine the query, like so, but this had an inconsequential impact on search time:
93.5 ms ± 20.1 ms per loop
%%timeit
random_index = random.randint(1, total_rows)
half_bbox_size= 100
# Step 3: Query for the specific row based on the random index
with engine.connect() as conn:
random_row = conn.execute(
text(f'''
SELECT id, AsGeoJSON(ST_Centroid(geom)) as centroid
FROM geometries
WHERE id = {random_index}
''')
).fetchone()
centroid_x,centroid_y=orjson.loads(random_row[1])['coordinates']
bounding_box_polygons = conn.execute(
text(f'''
SELECT id, AsGeoJSON(geom) as geom
FROM geometries
WHERE MbrIntersects(
geom,
BuildMBR(
{centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
{centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
)
) AND geometries.ROWID IN (
SELECT ROWID
FROM SpatialIndex
WHERE f_table_name = 'geometries'
AND search_frame = BuildMBR(
{centroid_x - half_bbox_size}, {centroid_y - half_bbox_size},
{centroid_x + half_bbox_size}, {centroid_y + half_bbox_size}
))
''')
).fetchall()
print(len(bounding_box_polygons), end= " " )
, I guess we could create a helper to update the queries for Spatialite.
I took a look at the code base here and its very much not obvious to me how one would even start. Is there someplace i could look to get a general sense of how this would work?
Wanted to quickly report back --- this is performance on a spatialite database with ~1 million rows
TL;DR - spatial index improves performance by ~16x : )
Yeah, spatial indexes are powerful ;-)
I took a look at the code base here and its very much not obvious to me how one would even start. Is there someplace i could look to get a general sense of how this would work?
I think that the first step is to write the 'right' query manually using GeoAlchemy2.
Then if we want to create a function to update the query automatically, we will have to introspect the 'naive' query (especially the query.whereclause
part) to detect if there is a spatial sub-clause in the WHERE clause, then remove this sub-clause and add the optimized sub-query that uses the spatial index. To be honest I don't really know how to do that. I think it's possible to manipulate the query.whereclause.clauses
, which is a kind of list of clauses, to remove what we want. Then we can just do something like new_query = query.where(spatial_index_sub_query)
to add what we want.
Here is quick POC which adds the spatial index subquery. I think it works and I'm actually not sure we want to remove the initial spatial WHERE clauses, since the spatial index is just just a quick approximative filter, so we still need to perform the actual spatial check on the filtered geometries. So maybe this POC is already complete, I'm not sure :smiley: Note that I didn't run any benchmark on this, so maybe it does not work correctly yet.
from functools import partial
from sqlalchemy import Column
from sqlalchemy import Table
from sqlalchemy import create_engine
from sqlalchemy import Integer
from sqlalchemy import MetaData
from sqlalchemy import func
from sqlalchemy import select
from sqlalchemy import text
from sqlalchemy.event import listen
from sqlalchemy.orm import declarative_base
from geoalchemy2 import Geometry
from geoalchemy2 import load_spatialite
db_connection_url = "sqlite:///test_database.sqlite"
engine = create_engine(db_connection_url, echo=True)
load_spatialite_wgs84 = partial(load_spatialite, init_mode="WGS84")
listen(engine, "connect", load_spatialite_wgs84)
conn = engine.connect()
metadata = MetaData()
Base = declarative_base(metadata=metadata)
class Point(Base):
__tablename__ = "point"
id = Column(Integer, primary_key=True)
geom = Column(Geometry(srid=4326, geometry_type="POINT"))
metadata.drop_all(conn, checkfirst=True)
metadata.create_all(conn)
point_spatial_index = Table("SpatialIndex", metadata, autoload_with=conn)
print("=================================================")
print(point_spatial_index)
print("=================================================")
# Insert point
conn.execute(
Point.__table__.insert(),
[
{"geom": "SRID=4326;POINT (0 0)"},
{"geom": "SRID=4326;POINT (1 1)"},
],
)
# Query point
compared_element = "SRID=4326;POLYGON((-1 -1, -1 1, 1 1, 1 -1, -1 -1))"
query = Point.__table__.select().where(Point.__table__.c.geom.ST_Intersects(compared_element), Point.__table__.c.geom.ST_Overlaps(compared_element))
print("#################################################")
print(query)
print("#################################################")
index_query = select(text("ROWID")).where(point_spatial_index.c.f_table_name == Point.__tablename__).where(point_spatial_index.c.search_frame == func.BuildMBR(-10, -10, 10, 10))
print("+++++++++++++++++++++++++++++++++++++++++++++++++")
print(index_query)
print("+++++++++++++++++++++++++++++++++++++++++++++++++")
new_query = query.where(Point.__table__.c.id.in_(index_query))
print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")
print(new_query)
print("$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$")
res = conn.execute(new_query).fetchall()
print("*************************************************")
print(res)
print("*************************************************")
Closing this issue because it has been inactive for a long time.
Describe the bug
I set up a simple sqlachelmy database with a single geometry:
I then query this for an intersection based on an arbitraty bounding box:
This yields the correct results, but takes a very long time (~7 seconds), versus PostGIS (~43ms)
for the engine i set echo = True and see the query:
I copy and paste this into sqlite3 and see that indeed a full scan has taken place:
even though it does indeed think an index exists:
Is this the expected behavior? How can one enforce SpatialIndex to be used?
Optional link from https://geoalchemy-2.readthedocs.io which documents the behavior that is expected
No response
To Reproduce
Error
Additional context
No response
GeoAlchemy 2 Version in Use
0.15.2
Python Version
3.10.12
Operating system
Linux