Insertion speed of spatial data is 16x slower than spatialite and 10x slower than postgis

cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.

https://www.cockroachlabs.com

Other

29.86k stars 3.77k forks source link

Insertion speed of spatial data is 16x slower than spatialite and 10x slower than postgis #127969

Closed choosehappy closed 2 weeks ago

choosehappy commented 1 month ago

I have a basic use case, a file containing ~88,000 geojson objects and try to insert them into a freshly spun up cockroachdb docker instance using sqlalchemy.

Notably, I’m using sqlalchemy core to perform the insertion (not the ORM), so there is minimal relation/connection/overhead with sqlalchemy.

This is a beefy server, 251GB of ram and 64 cores

The docker instance is started in single node mode like this:

docker volume create roach1

docker run -d --name=roach1 --hostname=roach1 --net=roachnet -p 26257:26257 -p 8080:8080 -v "roach1:/cockroach/cockroach-data" cockroachdb/cockroach:v23.2.5 start-single-node   --disable-max-offset-check --max-offset=5000ms --advertise-addr=roach1:26357   --http-addr=roach1:8080   --listen-addr=localhost:26257   --sql-addr=roach1:26257   --insecure

And then I run this code in the base operating system:

import json

from sqlalchemy import create_engine, Column, String, Integer, func, event, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.orm import sessionmaker
from geoalchemy2 import Geometry 
from tqdm import tqdm
from shapely.wkt import dumps
import shapely
from shapely.geometry import shape

# -

from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database

engine = create_engine('cockroachdb://root@localhost:26257/test1')#,echo=True)
create_database(engine.url)

with open('13_266069_040_003 L02 PAS.json', 'r') as file:
    data = json.load(file)

# Create a base class for our declarative mapping
Base = declarative_base()

# Define your SQLAlchemy model
class GeometryModel(Base):
    __tablename__ = 'geometries'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    geom = Column(Geometry('POLYGON'))

# -

# Create the table
Base.metadata.create_all(engine)

# Start a session
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()

batch_size=10_000
polygons=[]
with  engine.connect() as conn:
    tran = conn.begin()

    for geojson in tqdm(data):
        name = geojson["properties"]["classification"]["name"]
        geometry = json.dumps(geojson["geometry"])

        polygons.append({'name':name,'geom':geometry})

        if len(polygons) == batch_size:
            res=conn.execute(
                text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

            polygons = []
            tran.commit()
            tran = conn.begin()

    if polygons:
        res=conn.execute(
            text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

        polygons = []
        tran.commit()

The completion time is 9 minutes 31 seconds

I can confirm the same compute time when using a locally created 3 node cluster, from the tutorial:

docker network create -d bridge roachnet

docker volume create roach1
docker volume create roach2
docker volume create roach3

docker run -d --name=roach1 --hostname=roach1 --net=roachnet -p 26257:26257 -p 8080:8080 -v "roach1:/cockroach/cockroach-data" cockroachdb/cockroach:v23.2.5 start   --advertise-addr=roach1:26357   --http-addr=roach1:8080   --listen-addr=roach1:26357   --sql-addr=roach1:26257   --insecure   --join=roach1:26357,roach2:26357,roach3:26357

docker run -d   --name=roach2   --hostname=roach2   --net=roachnet   -p 26258:26258   -p 8081:8081   -v "roach2:/cockroach/cockroach-data"   cockroachdb/cockroach:v23.2.5 start     --advertise-addr=roach2:26357     --http-addr=roach2:8081     --listen-addr=roach2:26357     --sql-addr=roach2:26258     --insecure     --join=roach1:26357,roach2:26357,roach3:26357

docker run -d   --name=roach3   --hostname=roach3   --net=roachnet   -p 26259:26259   -p 8082:8082   -v "roach3:/cockroach/cockroach-data"   cockroachdb/cockroach:v23.2.5 start     --advertise-addr=roach3:26357     --http-addr=roach3:8082     --listen-addr=roach3:26357     --sql-addr=roach3:26259     --insecure     --join=roach1:26357,roach2:26357,roach3:26357

docker exec -it roach1 ./cockroach sql --host=roach2:26258 --insecure

Associated runtime:

When performing the same operation, instead pointing sqlalchemy at a spatilate database [code below], the insertion time is ~35 seconds, when using postgis the insertion time is ~1 minute

Thanks to sqlalchemy the only meaningful difference for spatialite is this event to load the extension:

# Initialize Spatialite extension
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    dbapi_connection.enable_load_extension(True)
    dbapi_connection.execute('SELECT load_extension("mod_spatialite")')

and for postgis:

# Initialize Spatialite extension
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    with dbapi_connection.cursor() as cursor:
        cursor.execute('CREATE EXTENSION IF NOT EXISTS postgis;')

Thoughts?

Jira issue: CRDB-40709

blathers-crl[bot] commented 1 month ago

Hello, I am Blathers. I am here to help you get the issue triaged.

It looks like you have not filled out the issue in the format of any of our templates. To best assist you, we advise you to use one of these templates.

I have CC'd a few people who may be able to assist you:

@cockroachdb/sql-queries (found keywords: import)
@cockroachdb/sql-foundations (found keywords: ORM,ORM,sqlalchemy)
@cockroachdb/cdc (found keywords: cdc)

If we have not gotten back to your issue within a few business days, you can try the following:

Join our community slack channel and ask on #cockroachdb.
Try find someone from here if you know they worked closely on the area and CC them.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

blathers-crl[bot] commented 1 month ago

cc @cockroachdb/cdc

DrewKimball commented 1 month ago

Hi @choosehappy, thanks for the report. Here are some suggestions:

One thing that comes to mind is cache size. By default, the cache size is just 128MiB. For production (or performance testing), this value should be increased to suit the hardware. There are some instructions for doing that here: https://www.cockroachlabs.com/docs/v24.1/recommended-production-settings#cache-and-sql-memory-size.
Another question is: are these inserts run with any degree of concurrency? If not, more concurrently-run inserts, possibly with smaller batch size, may allow CRDB to better utilize the hardware.
Finally, I believe SQLAlchemy will generate a SERIAL primary key for your model. This sort of ascending key tends to create workload hotspots, which hurts performance. See also https://www.cockroachlabs.com/docs/stable/serial and https://www.cockroachlabs.com/docs/v24.1/sql-faqs#how-do-i-auto-generate-unique-row-ids-in-cockroachdb.

I'd recommend checking the impact of increasing the batch size on the single-node cluster, first.

choosehappy commented 1 month ago

Amazing, thanks for your feedback!

as per the documentation, I went ahead an added these commands to greatly increase ram + cache:

 --cache=.35 --max-sql-memory=.35

i can confirm they were taken into account because there is a new warning message in the log:

* WARNING: recommended default value of --max-go-memory (192 GiB) was truncated to 128 GiB, consider reducing --max-sql-memory and / or --cache 
*

ran the same test and while the insertion speed did improve, it didn't do so "meaningfully" - now it takes 7 minutes instead of 9.5minutes, so still in the order of magnitude away from expectation.

degree of concurrency

Currently no concurrency in either of the approachs (CRDB, PostGIS, sqllite). FYI, When adding concurrency to PostGIS the insertion time reduces from 1 minutes down to 11 seconds.

I think the issue here is that even a batch of 10,000 takes ~70 seconds. so while i could e.g., take my 90,000 objects, and break them into 9 groups of 10k, if they all inserted in perfect parallel, the maximum insertion speed would still be 70 seconds (7x slower than postgis for the same operation)

I reduced the batch size to 1,000 objects - and the insertion take 6 seconds, which itself seems to be the root cause of the performance issue? one would need e.g., 88 parallel insertion statements to get the same throughput as postGIS.

its like, the insertion itself just appears to be much slower than expected?

Or maybe i'm wrong - what is a reasonable expectation in this use case for insertion speed?

choosehappy commented 1 month ago

Reporting back regarding SERIAL, this does not appear to be the issue either

Modified the code as such:

        polygons.append({'id':counter, 'name':name,'geom':geometry})
        counter+=1
        if len(polygons) == batch_size:
            res=conn.execute(
                text("INSERT INTO geometries (id,name,geom) VALUES (:id,:name,ST_GeomFromGeoJSON(:geom));"),polygons)

so that the ID number is provided directly by the client, and the performance remained the same (6 minutes 50 seconds)

any other ideas?

DrewKimball commented 1 month ago

Thanks for trying that out. Those were the obvious things to try, so if we want to dig in further, it would be helpful to gather some additional info. You can collect a tsdump by following the instructions here. This will show various cluster metrics for the collected timeframe. It would also be helpful to collect a statement bundle, which you can do by running the statement using EXPLAIN ANALYZE (DEBUG) (docs here). You can use this link to send that information to me privately.

WRT SERIAL, it would only (potentially) be a problem once you introduced concurrency - changing it shouldn't have any impact without concurrency. For the record though, using a counter runs into the same sort of issue, where concurrent queries would write to the same range, causing a workload hotspot. A better option would be to have have CRDB automatically generate a randomized ID for you. Again, this is only an issue in the presence of concurrency, and in particular, when the cluster has multiple nodes.

Or maybe i'm wrong - what is a reasonable expectation in this use case for insertion speed?

As far as expectations go, it's difficult to say until we gather more information. While I don't expect a performance gap of this size, this test setup is somewhat atypical for a workload on CRDB - generally, you would expect many concurrent requests load-balanced across the nodes in the cluster. So my intuition may not apply to this case.

I should also mention before taking this discussion too far that we have IMPORT INTO, which is specifically intended for bulk-loading data into a table (docs here, performance tips here).

choosehappy commented 1 month ago

Thanks for your feedback, can i quickly confirm that tsdump is expected to work in a single node setup?

i've done this to startup a single node in a docker container as above:

docker run -d --name=roach1 --hostname=roach1 --net=roachnet -p 26257:26257 -p 8080:8080 -v `pwd`:/out  cockroachdb/cockroach:v23.2.5 start-single-node   --disable-max-offset-check --max-offset=5000ms --advertise-addr=roach1:26357   --http-addr=roach1:8080   --listen-addr=localhost:26257   --sql-addr=roach1:26257   --insecure --cache=.35 --max-sql-memory=.35

which seems to work as expected

i then enter the docker container:

docker exec -it roach1 bash

and try to execute the tsdump command but receive an error:

[root@roach1 cockroach]# cockroach debug tsdump --format=raw --insecure > tsdump.gob
ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

failed to connect to `host=localhost user=root database=system`: failed to receive message (read tcp 127.0.0.1:53726->127.0.0.1:26257: read: connection reset by peer)
Failed running "debug tsdump"

i also tried adding a port with no success:


[root@roach1 cockroach]# cockroach debug tsdump --format=raw --insecure --host=roach1:26257 > tsdump.gob
E240806 15:35:58.966462 105 2@rpc/peer.go:601  [rnode=?,raddr=roach1:26257,class=system,rpc] 1  failed connection attempt (last connected 0s ago): grpc: connection error: desc = "error reading server preface: read tcp 172.18.0.2:36114->172.18.0.2:26257: read: connection reset by peer" [code 14/Unavailable]
ERROR: server closed the connection.
Is this a CockroachDB node?
initial connection heartbeat failed: grpc: connection error: desc = "error reading server preface: read tcp 172.18.0.2:36114->172.18.0.2:26257: read: connection reset by peer" [code 14/Unavailable]
Failed running "debug tsdump"
[root@roach1 cockroach]#

I would think something was wrong, but everytihng seems to be working correctly:

[root@roach1 cockroach]# cockroach sql --host=roach1:26257 --insecure
#
# Welcome to the CockroachDB SQL shell.
# All statements must be terminated by a semicolon.
# To exit, type: \q.
#
# Server version: CockroachDB CCL v23.2.5 (x86_64-pc-linux-gnu, built 2024/05/04 00:05:56, go1.21.9 X:nocoverageredesign) (same version as client)
# Cluster ID: 76d04be1-eae0-48a2-9474-3e505088a83b
#
# Enter \? for a brief introduction.
#
root@roach1:26257/defaultdb> create database test1;
CREATE DATABASE

Time: 36ms total (execution 35ms / network 0ms)

root@roach1:26257/defaultdb>

choosehappy commented 1 month ago

Just sent you the explain analyze (debug) output - that worked no problem

I should also mention before taking this discussion too far that we have IMPORT INTO, which is specifically intended for bulk-loading data into a table (docs here, performance tips here).

Thanks appreciate it, but it doesn't quite fit into our use case. In the end, we're building a web-based UI where the users will be submitting 1000s of polygons via REST that we will want to insert into CRDB.

As i understand, the import into requires an absolute path to a file to be used

choosehappy commented 1 month ago

I also bundled together and sent over the minimal reproducible code snippet, along with an example of the data being used

DrewKimball commented 3 weeks ago

and try to execute the tsdump command but receive an error:

I see a recently fixed bug that may be causing those tsdump errors: https://github.com/cockroachdb/cockroach/issues/125315. The fix for that is included in v23.2.9, while it looks like you're on v23.2.5. As a workaround, looks like you might be able to use default ports.

Just sent you the explain analyze (debug) output - that worked no problem

I took a look at this - the issue is that the bundle is gathered on the very first run of the query, so it includes start-up costs that aren't representative of the average performance. It might be easier for you to gather the bundle through the DB console during a workload run: https://www.cockroachlabs.com/docs/stable/ui-statements-page#diagnostics

By the way, I was able to run your code snippet + example data locally on a single node on my m2 machine. With --cache set to 8GiB, the test completed in ~64 seconds. Was the example data you provided the same as that from your testing where CRDB performed poorly?

choosehappy commented 3 weeks ago

Interesting!

Yes, that is the same data + code that i'm using. very curious!

inline with your comment, I changed my docker command to use the "latest" tag, and set the --cache equal to 8Gib, this is my new command:

docker run -d --name=roach1 --hostname=roach1 --net=roachnet -p 26257:26257 -p 8080:8080 -v `pwd`:/out  cockroachdb/cockroach:latest start-single-node   --disable-max-offset-check --max-offset=5000ms --advertise-addr=roach1:26357   --http-addr=roach1:8080   --listen-addr=localhost:26257   --sql-addr=roach1:26257   --insecure --cache=8GiB

The cache command seems to have taken hold and is shown in iotop:

running the same code, has the same timings as before:

this is what the system looks like while doing the insert:

in case this triggers any ideas?

Good news is that the tsdump works in this version - i've just uploaded it

Something curious seems to be happening here - if i look at the web admin console for the statement, it seems to think these statements are occuring in ms? but the actual wall time tells a different story?

i guess though if i look here:

it would suggest that it is running 80k queries (seperate insert for each), with the time being 3.6ms? but the p99 and max latency:

when doing some multiplication by 80, can quickly add up to the times i'm seeing?

Another piece of evidence, mean transaction time 44.5s, for 10k rows, so 44s * ~8 sets of 10k rows = ~6 minutes

just sent you the statement bundle as well produced by the provided code

Any ideas?

DrewKimball commented 3 weeks ago

just sent you the statement bundle as well produced by the provided code

To be clear, was this statement bundle gathered against the same node that had just run the workload, with no restart?

choosehappy commented 3 weeks ago

yup, same node, same code, same data

DrewKimball commented 3 weeks ago

Would you be willing to manually collect a statement bundle during an ongoing workload run through the DB console? You can just select Trace and collect diagnostics on the next statement execution to grab one for a random query. Both the bundles you've gathered so far are showing a lot of time spent in work that should only happen once, not per-query. If that work is actually happening often, that's very interesting.

this is what the system looks like while doing the insert:

Am I reading correctly that there are many active CRDB instances on your machine? Or is that view showing historical information as well? I'd expect the single node to be created once and reused for the entire test.

choosehappy commented 3 weeks ago

Would you be willing to manually collect a statement bundle during an ongoing workload run through the DB console?

This is indeed how i generated your statement. i let it insert 80k rows. The insert statement showed up in the DB consule, i then clicked Trace and collect giadnostics on the next statement execution. i then reran my jupyter notebook, the trap was successful and i downlaoded that statement bundle

Am I reading correctly that there are many active CRDB instances on your machine? Or is that view showing historical information as well? I'd expect the single node to be created once and reused for the entire test.

There is only a single CRDB instance, created using the docker command i showed above. I've done literally nothing else after spinning up that docker instance besides going into my jupyter notebook and running the code

I'm surprised that this surprised you - seems to suggest that this is where the problem is. I don't know the internals of how CRDB works, but it sounds like a new worker process is getting launched for every batch insertion, resulting in a # of working processes and as well a recomputation of the work that should only happen once?

Is something like that a possibility?

DrewKimball commented 3 weeks ago

There is only a single CRDB instance, created using the docker command i showed above. I've done literally nothing else after spinning up that docker instance besides going into my jupyter notebook and running the code

I'm surprised that this surprised you - seems to suggest that this is where the problem is. I don't know the internals of how CRDB works, but it sounds like a new worker process is getting launched for every batch insertion, resulting in a # of working processes and as well a recomputation of the work that should only happen once?

It looks like this is just a difference between htop on linux vs mac. We indeed expect there to be an OS thread per vCPU - but linux displays them separately, while htop on macOS shows just one entry for the CRDB server.

This is indeed how i generated your statement. i let it insert 80k rows. The insert statement showed up in the DB consule, i then clicked Trace and collect giadnostics on the next statement execution. i then reran my jupyter notebook, the trap was successful and i downlaoded that statement bundle

The issue with doing this is that you're going to collect the bundle on the first query of that re-run, which may be special, and not representative. I'm hoping to see a few bundles for "mundane"/average executions of the query, which happen sometime after your script begins issuing queries.

I did have a chance to take a look at the tsdump as well. We don't seem to be bound on any resource, which is expected, since the workload isn't generating any concurrency. I did notice that the average WAL fsync latency is nearly as high as the average query latency:

For context, the average query latency alone for my local tests was ~500μs. Given the artificial latency-bound nature of this test, we may not be able to drive much better throughput on this hardware without introducing concurrency. I'd be curious to know what Postgres is doing to apparently avoid paying that fsync latency cost. Are you running Postgres with default settings?

tbg commented 3 weeks ago

Hello @choosehappy, I've looked at this a bit with @DrewKimball and what sticks out to me is the --max-offset=5000ms flag. This explains the weird lease behavior we're seeing in the statement bundle: leases aren't usable for the last max-offset of their lifetime, so with 5s of lifetime deducted there simply isn't any lifetime left and the system enters a weird state. I'm actually surprised it worked at all.

Please simply remove this entire option (keep --disable-max-offset-check if you like) and try again. I'll file an issue to consider a better guardrail against this misconfiguration.

choosehappy commented 3 weeks ago

Thanks for your feedback @tbg

those parameters were left over from my multi-node trials, which were all taking place on the same machine, so things got a bit....slow : )

here is the same output without max offset and disable removed, you can as well see the commands on the right

choosehappy commented 3 weeks ago

for postgis, really nothing special at all, and the same code (thanks for sqlalchemy!)

docker run -p 5333:5432 --name postgis -e POSTGRES_HOST_AUTH_METHOD=trust  -d postgis/postgis

and then run the same code:

# ---
# jupyter:
#   jupytext:
#     formats: ipynb,py:light
#     text_representation:
#       extension: .py
#       format_name: light
#       format_version: '1.5'
#       jupytext_version: 1.16.3
#   kernelspec:
#     display_name: Python 3 (ipykernel)
#     language: python
#     name: python3
# ---

# +
from sqlalchemy import create_engine, Column, String, Integer, func, event, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.orm import sessionmaker
from geoalchemy2 import Geometry 
from tqdm import tqdm
from shapely.wkt import dumps

import orjson

# +
# %%time
with open('13_266069_040_003 L02 PAS.json', 'r') as file:
#with open('/mnt/c/research/kidney/15_26609_024_045 L03 PAS.json', 'r') as file:
    # Load the JSON data into a Python dictionary
    data = orjson.loads(file.read())

import shapely
from shapely.geometry import shape
# -

data[0]

# +
# Create a base class for our declarative mapping
Base = declarative_base()

# Define your SQLAlchemy model
class GeometryModel(Base):
    __tablename__ = 'geometries'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    geom = Column(Geometry('POLYGON'))

# +
from sqlalchemy_utils import database_exists, create_database
engine = create_engine('postgresql://postgres@localhost:5333/test1')#,echo=True)

print(engine.url)
try:
    create_database(engine.url)
    print("created")
except:
    print("errored")
    pass

# -

# Initialize Spatialite extension
@event.listens_for(engine, "connect")
def connect(dbapi_connection, connection_record):
    with dbapi_connection.cursor() as cursor:
        cursor.execute('CREATE EXTENSION IF NOT EXISTS postgis;')

# Create the table
Base.metadata.create_all(engine)

from tqdm import tqdm

# %%time

batch_size=5_000
polygons=[]
with  engine.connect() as conn:
    tran = conn.begin()

    for geojson in tqdm(data):
        name = geojson["properties"]["classification"]["name"]
        geometry = orjson.dumps(geojson["geometry"]).decode('ascii')

        polygons.append({'name':name,'geom':geometry})

        if len(polygons) == batch_size:
            res=conn.execute(
                text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

            polygons = []
            tran.commit()
            tran = conn.begin()

    if polygons:
        res=conn.execute(
            text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

        polygons = []
        tran.commit()

# %%time
with  engine.connect() as conn:
    res=conn.execute(text("select ST_AsGeoJSON(ST_Centroid(geom)) as centroid from geometries"))
    centroids=res.fetchall()

choosehappy commented 3 weeks ago

I uploaded a few more bundles - i tried to address your points? what i did was run a batch of inserts, then the insert shows up in the UI, then i click store on next run, then insert the next back, then store on next run, run the next batch, and did this 4 times (there are 4 bundles uploaded). not sure there is anything else that can be done?

DrewKimball commented 3 weeks ago

Those bundles look a lot better now that max-offset is fixed. CRDB isn't doing anything unexpected here, and the insert itself seems about as well-optimized as it can be. I expect that the overhead is due to a few factors:

CRDB uses a different (distributed execution friendly) index structure than Postgres for geospatial data, which requires four index writes when inserting a single row
Even in a single-node context, CRDB pays some cost for being able to handle distributed consensus, execution etc.
The above are probably exacerbated by docker overhead.

In general, CRDB prefers to deal with many concurrent requests, rather than optimizing the latency of an individual request. If you want to saturate the hardware and see the maximum possible performance for this workload, you'll have split up the work among multiple simultaneous connections. Does that make sense?

choosehappy commented 3 weeks ago

Yup, that makes sense

I have a version of that as well, which uses the Ray framework to submit all batches at the same, thus greatly improving concurrency.

I swear this is true --- when i do that with PostGis, I can insert 1 million records in 28.7s : )

The challenge that i see here is that with CRDB, a single batch (10k rows, as shown above) takes 48s - so while using more concurrently will likely increase the insert row throughput - my assumption is that it can't ever be faster than 48s, i.e. the fastest single insert time. This also means that for a 48s run time, 100 batches of 10k rows would have to happen concurrently instantaneously, on a server with only e.g., 64 processes - which seems unlikely. whats more likely, i would guess, is a run time of at least 48 * 2 = 96 seconds (1 set of 64 batches for 48 seconds and then and one set of 36 for another 48 seconds).

All this assumes no actual overhead - so is a very generous minimal time

Does that math check out for you?

I can run it through tomorrow if you're curious, Ray makes it so that its genuinely on a few lines of additional code (amazing framework btw - i use it all the time and wrote a series of blog posts about it if you're interested https://andrewjanowczyk.com/ray-an-open-source-api-for-easy-scalable-distributed-computing-in-python-part-1-local-scaling/ )

DrewKimball commented 3 weeks ago

The challenge that i see here is that with CRDB, a single batch (10k rows, as shown above) takes 48s - so while using more concurrently will likely increase the insert row throughput - my assumption is that it can't ever be faster than 48s, i.e. the fastest single insert time.

That's true, though only if the batch size is 10k rows. I'd recommend splitting into something more like 1k rows, and then try and find whatever concurrency level brings the best throughput. 64 is a good place to start, but I wouldn't be surprised if you get best improvement in throughput with something a bit larger, even if it hurts individual request latency. That way, the CPU always has something to work on even while we're writing to disk and such. I doubt scaling will be linear up to 64 CPUs, but there should still be some good improvement.

Another thing that occurs to me - you can probably alleviate a significant amount of the overhead by combining these batches into a single statement. I'm not sure how feasible that is with your framework, but it would change a lot of executed code from O(rows) -> O(batches).

choosehappy commented 3 weeks ago

Thanks for your feedback, i'll give it a try tomorrow and report back

In regards to your last suggestions - i'm not sure i totally understand. my insert statement is only being executed once per batch - all of the polygons are being fed as a list of dictionaries, so in this case below, len(poylgons) is == batch_size

is there a way to even further combine them?

for geojson in tqdm(data):
        name = geojson["properties"]["classification"]["name"]
        geometry = orjson.dumps(geojson["geometry"]).decode('ascii')

        polygons.append({'name':name,'geom':geometry})

        if len(polygons) == batch_size:
            res=conn.execute(
                text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

            polygons = []
            tran.commit()
            tran = conn.begin()

DrewKimball commented 3 weeks ago

This line:

res=conn.execute(text("INSERT INTO geometries (name,geom) VALUES (:name,ST_GeomFromGeoJSON(:geom));"),polygons)

appears to generate individual SQL statements for each of the inserted rows. Here's an example from one of the statement bundles:

INSERT
INTO
  geometries (id, name, geom)
VALUES
  (
    30001,
    'tubule',
    st_geomfromgeojson('{"type": "Polygon", "coordinates": ... ]]}':::STRING)
  )

What I mean by batching is that you could insert (say) 1000 rows in a single INSERT:

INSERT INTO geometries (id, name, geom) (VALUES
  (...),
  (...),
  (...),
  (...),
  (...),
  (...),
  -- And so on up to 1k rows.
);

It looks like SQLAlchemy can be made to do this: https://docs.sqlalchemy.org/en/20/orm/queryguide/dml.html#orm-enabled-insert-update-and-delete-statements

choosehappy commented 3 weeks ago

Interesting --- Is it CKRB doing individual statements? that may be the cause of this : )

SQLalchemy is supposed to do this on our behalf, if you look here:

https://docs.sqlalchemy.org/en/14/tutorial/dbapi_transactions.html#tutorial-multiple-parameters

As an example, if i use a sqlite3 database, it does indeed only perform a single insert statement with multiple values, as you've shown:

# ---

# +
from sqlalchemy import create_engine, Column, String, Integer, func, event, text
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.dialects.postgresql import ARRAY
from sqlalchemy.orm import sessionmaker
from geoalchemy2 import Geometry 
from tqdm import tqdm
from shapely.wkt import dumps

# +
# Create a base class for our declarative mapping
Base = declarative_base()

# Define your SQLAlchemy model
class GeometryModel(Base):
    __tablename__ = 'geometries'
    id = Column(Integer, primary_key=True)
    name = Column(String)
    name2 = Column(String)

# +
from sqlalchemy_utils import database_exists, create_database
engine = create_engine('sqlite:///test1.db',echo=True)

print(engine.url)
try:
    create_database(engine.url)
    print("created")
except:
    print("errored")
    pass

# -

Base.metadata.create_all(engine)

import string
import random
def random_string(length=10):
    """Generate a random string of fixed length."""
    letters = string.ascii_letters
    return ''.join(random.choice(letters) for _ in range(length))

# +
# Using the engine to connect to the database
polygons = []
batch_size=10
with engine.connect() as conn:
    tran = conn.begin()

    for _ in range(100):
        name = random_string()  # Generate a random string for 'name'
        name2 = random_string() # Generate a random string for 'name2'

        polygons.append({'name': name, 'name2': name2})

        if len(polygons) == batch_size:
            res = conn.execute(
                text("INSERT INTO geometries (name, name2) VALUES (:name, :name2);"),
                polygons
            )
            polygons = []
            tran.commit()
            tran = conn.begin()

Yields this output:

which is in fact batched as expected into a single insert statement

choosehappy commented 3 weeks ago

by the way, when changing engine to CRDB

engine = create_engine('cockroachdb://root@localhost:26257/test1',echo=True)

the output is similar (except in a CRDB dialect) -- i see a single insert statement being executed for multiple values:


2024-08-15 10:12:23,205 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-15 10:12:23,207 INFO sqlalchemy.engine.Engine INSERT INTO geometries (name, name2) VALUES (%(name)s, %(name2)s);
2024-08-15 10:12:23,208 INFO sqlalchemy.engine.Engine [generated in 0.00085s] [{'name': 'ZrSsaaJxMm', 'name2': 'vljuDAWtES'}, {'name': 'OxLiKuNKLG', 'name2': 'uwbWedHKMx'}, {'name': 'fLFYHukSus', 'name2': 'qWsDrOolKl'}, {'name': 'SLAnqjGvgE', 'name2': 'FTzlkcKLEp'}, {'name': 'lvbSecgmbh', 'name2': 'BUzUciWjSm'}, {'name': 'UdYRAJjmRE', 'name2': 'dNHqFwApiY'}, {'name': 'tqPaoMIrlw', 'name2': 'DMChCCbtIH'}, {'name': 'jvbFoRXXId', 'name2': 'mJIsXHpOdX'}, {'name': 'ENlyiVHSFy', 'name2': 'WLCuFEyxUL'}, {'name': 'uPzktNQzpH', 'name2': 'qdosnEsRpm'}]
2024-08-15 10:12:23,221 INFO sqlalchemy.engine.Engine COMMIT
2024-08-15 10:12:23,236 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-15 10:12:23,238 INFO sqlalchemy.engine.Engine INSERT INTO geometries (name, name2) VALUES (%(name)s, %(name2)s);
2024-08-15 10:12:23,239 INFO sqlalchemy.engine.Engine [cached since 0.03218s ago] [{'name': 'BXgWqwWUGd', 'name2': 'rRIPkIQBRI'}, {'name': 'dOtkRgtSQY', 'name2': 'qnlFsdaQnW'}, {'name': 'CLJHfGkUjj', 'name2': 'fWnuPBPxCt'}, {'name': 'WZcxmrNwWP', 'name2': 'MuruHpiZGy'}, {'name': 'MuBqyUiBZe', 'name2': 'sGfbRacLAi'}, {'name': 'TSjgtGMEor', 'name2': 'gtPqIzjFet'}, {'name': 'WEaSBUEfsd', 'name2': 'pJxhLBUikF'}, {'name': 'GzvTuBaRuq', 'name2': 'GgAoYxMQdp'}, {'name': 'CfRUSwLHog', 'name2': 'HxmqoUeZrX'}, {'name': 'LypgNsMJqH', 'name2': 'XVXoaKwloK'}]
2024-08-15 10:12:23,267 INFO sqlalchemy.engine.Engine COMMIT
2024-08-15 10:12:23,282 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-08-15 10:12:23,283 INFO sqlalchemy.engine.Engine INSERT INTO geometries (name, name2) VALUES (%(name)s, %(name2)s);
2024-08-15 10:12:23,284 INFO sqlalchemy.engine.Engine [cached since 0.07656s ago] [{'name': 'MOMsYdbemJ', 'name2': 'FfLujigiMZ'}, {'name': 'GGlFEbCGrl', 'name2': 'xnrXWXxBUw'}, {'name': 'suaMiTFESX', 'name2': 'ZyHiCvEMwV'}, {'name': 'YbEjyBLWXN', 'name2': 'HseKBMwlBj'}, {'name': 'YmDgfSfKwn', 'name2': 'aTrRbFqfPK'}, {'name': 'KwyynHHBWL', 'name2': 'VBfConTBik'}, {'name': 'ORYIrhKPfH', 'name2': 'sFSPWtxXob'}, {'name': 'PtdrrPLqnc', 'name2': 'WgwqDvnvyz'}, {'name': 'zbykAubNvN', 'name2': 'jwEkfHKlTH'}, {'name': 'qzPDJuHJrh', 'name2': 'qHbGnmPiRb'}]

So i'm wondering - is sqlalchemy trying to send a bulk insert, but the CRDB dialect switching it into indivual statement (which for sure is more expensive) - do the stmt bundles show something different?

DrewKimball commented 3 weeks ago

You might be right that the CockroachDB dialect is missing some optimizations that others have. That being said, I believe if you use the insert construct mentioned here instead of text, SQLAlchemy will generate a single statement for you.

choosehappy commented 3 weeks ago

Interesting - now we're getting somewhere!

In this simple test, the ORM version takes 1.43s vs Core 18.6s

this is the ORM version:

%%time
# ORM

# Start a session
from sqlalchemy.orm import sessionmaker
Session = sessionmaker(bind=engine)
session = Session()

polygons = []
batch_size=1_000
with engine.connect() as conn:
    tran = conn.begin()

    for _ in range(10_000):
        name = random_string()  # Generate a random string for 'name'
        name2 = random_string() # Generate a random string for 'name2'

        polygons.append(GeometryModel(name=name, name2=name2))

        if len(polygons) == batch_size:
            if len(polygons) == batch_size:
                session.bulk_save_objects(polygons)
                session.commit()
                polygons.clear()  # Clear the list for the next batch

session.close()

versus the native version:

%%time
# Core
polygons = []
batch_size=1_000
with engine.connect() as conn:
    tran = conn.begin()

    for _ in range(10_000):
        name = random_string()  # Generate a random string for 'name'
        name2 = random_string() # Generate a random string for 'name2'

        polygons.append({'name': name, 'name2': name2})

        if len(polygons) == batch_size:
            res = conn.execute(
                text("INSERT INTO geometries (name, name2) VALUES (:name, :name2);"),
                polygons
            )
            polygons = []
            tran.commit()
            tran = conn.begin()

And looking at the commands, they have indeed changed:

The ORM version explicitly states insert many:

2024-08-15 13:38:08,355 INFO sqlalchemy.engine.Engine INSERT INTO geometries (name, name2) VALUES (%(name__0)s, %(name2__0)s), (%(name__1)s, %(name2__1)s), (%(name__2)s, %(name2__2)s), (%(name__3)s, %(name2__3)s), (%(name__4)s, %(name2__4)s), (%(name__5)s, %(name2__5)s), (%(name__6)s, %(name2__6)s), (%(name__7)s, %(name2__7)s), (%(name__8)s, %(name2__8)s), (%(name__9)s, %(name2__9)s)
2024-08-15 13:38:08,356 INFO sqlalchemy.engine.Engine [generated in 0.00025s (insertmanyvalues) 1/1 (unordered)] {'name__0': 'okgtuQUOdV', 'name2__0': 'ZfaLpGFFnL', 'name__1': 'yvoNyOEEoK', 'name2__1': 'BtoKVLVgGe', 'name__2': 'snHSyARQYV', 'name2__2': 'mhknJZnNIu', 'name__3': 'IrNzqaftFz', 'name2__3': 'sbeeCOZzMP', 'name__4': 'rerzyuNLRP', 'name2__4': 'ZfKJJvOavg', 'name__5': 'RZVSsGmLPj', 'name2__5': 'zqGljuxWHz', 'name__6': 'nsUEVOodIy', 'name2__6': 'wiTUyjyxed', 'name__7': 'tUMqaZTAyi', 'name2__7': 'IeGKwOfDFl', 'name__8': 'hwttLtNDuv', 'name2__8': 'bnWEhMcXLj', 'name__9': 'IoJJPxHNjm', 'name2__9': 'wBPxLHwHIk'}

versus the Core version:

2024-08-15 13:38:14,786 INFO sqlalchemy.engine.Engine INSERT INTO geometries (name, name2) VALUES (%(name)s, %(name2)s);
2024-08-15 13:38:14,788 INFO sqlalchemy.engine.Engine [generated in 0.00252s] [{'name': 'CdvPixMuwU', 'name2': 'hLyWZaaQkb'}, {'name': 'EUChSOpFQm', 'name2': 'bNArJNpYWl'}, {'name': 'gzconipVlF', 'name2': 'fDxWyVIcqr'}, {'name': 'TqSaqiwsKj', 'name2': 'aYMtSPkaPh'}, {'name': 'RnQOAPzMOt', 'name2': 'VGDYYmoBji'}, {'name': 'HOWEFOQuRY', 'name2': 'YobOnnwpEe'}, {'name': 'CfECmnyqKZ', 'name2': 'JHrTibPwAX'}, {'name': 'nXYxvWtwjD', 'name2': 'cpGCgULwZE'}, {'name': 'BzZsajTVNf', 'name2': 'drTqmpWoSC'}, {'name': 'AEQhXpLrIw', 'name2': 'wLAXkUbfHf'}]

choosehappy commented 3 weeks ago

Okay, going back to the original use case of inserting 88k rows, i've compared CRDB with PostGIS using both ORM and Core, the results are below.

Database	Method	Rows	Time
Cockroach	ORM	88k	1m 40s
Cockroach	Core	88k	6m 37s
PostGIS	ORM	88k	43s
PostGIS	Core	88k	1m 9s

So:

CRDB ORM is indeed ~6x faster than using the core directly, but is ~2.3x slower than PostGIS ORM
CRDB ORM is about the same as PostGIS core
CRDB Core is also 6x closer than PostGIS core

2.3x is pretty reasonable in this situation given all the added functionality + scalability of CRDB

choosehappy commented 3 weeks ago

Reporting multi-processing results, here looking at inserting 1 million rows (basically 12 iterations of the 88k dataset):

Database	Method	Rows	Time
Cockroach	ORM	~1M	3m 14s
Cockroach	Core	~1M	2m 43s

PostGIS	ORM	~1M	53s
PostGIS	Core	~1M	28.7s

Interestingly, we see that it actually takes longer in this manner to use ORM versus the core, likely because there is more overhead per work package (e.g., setting up the class + performing the DB mapping)

As expected, noting that here we're using 1 million rows, the throughput is indeed increased, based on previous timings inserting 1 million rows would have taken ~18 minutes instead of he 3 minutes reported here., so a 6x speed improvement, but lower than i think we were expecting, in particular because CRDB is definitely working hard : )

Any thoughts?

DrewKimball commented 3 weeks ago

It might be possible to improve things in one or the other case by tuning batch size + concurrency, but this seems like a pretty good state. Returns beyond this point will likely be diminishing. Is it alright with you to close out the issue?

choosehappy commented 3 weeks ago

sure. i'm putting together a more detailed + organized blog post comparing different DBs for our use case as part of the scalable version of our open-source tool https://github.com/choosehappy/QuickAnnotator. i'll drop that link here when i'm done for your reference and posterity of whoever reads this thread next : )

Message ID: @.***>