add a synchronous api - Githubissues

ceymard commented 3 years ago

As the node.js driver better-sqlite3 shows, when using in-process databases that do little to no IO, performances are vastly improved when using the database synchronously instead of having everything go through the event loop.

Since duck db is pretty focused on analytics, I actually doubt that this driver will be used in many other settings than a single user process crunching numbers.

Maybe it would be a good idea to add a synchronous way of speaking with duckdb ?

rprovodenko commented 3 years ago

There are two parts to this:

the execute call which indeed causes duckdb to perform the query in another thread, unblocking the event loop
and the fetch call (which the streaming API uses) which retrieves data from duckdb and this happens on the main thread

Query execution can be long, I don't think that starving the event loop for such massive amounts of time is a good thing. But also, I don't see how doing so would "vastly" improve performance, since you're scheduling the long running operation to be processed outside the main thread only once.

ceymard commented 3 years ago

The gains are mostly in the case where you exchange lots of successive queries with the database ; for instance hundreds or thousand consecutive updates that depend on the result of select queries.

Look at the benchmarks (and what is being benchmarked) by the better-sqlite3 library : https://github.com/JoshuaWise/better-sqlite3

The way I look at it is that the asynchronous API is really only useful

if you have long running queries and actual stuff to do while waiting for the result (usually not the case when scripting stuff)
you have the duckdb as a backend to a web application (or any server, really), which is not the intended purpose of duckdb, though it can be used as such

I think asynchronous is useful but probably not the main approach I'd be using when dealing with a database meant for processing data and not simply storing it.

I have scripts that handle big loads of data but that don't process anything in parallel. In that case, I don't mind blocking the process since anyways it would just be one async function waiting on the result of execute. I also have scripts that do a bunch of individual updates in quick sequence, in which case I do appreciate the speedup resulting from not deferring on to the event loop an operation that is almost instant since everything is local.

Anyways, I think there is a place for both approaches, and I'm also pretty certain that a synchronous access to a local database is much more efficient in a lot of scenarios (if not most) this particular database will end up being used in.

On Wed, May 19, 2021 at 6:43 PM Rostislav Provodenko < @.***> wrote:

There are two parts to this:

the execute call which indeed causes duckdb to perform the query in another thread, unblocking the event loop

and the fetch call (which the streaming API uses) which retrieves data from duckdb and this happens on the main thread Query execution can be long, I don't think that starving the event loop for such massive amounts of time is a good thing. But also, I don't see how doing so would "vastly" improve performance, since you're scheduling the long running operation to be processed outside the main thread only once.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepcrawl/node-duckdb/issues/94#issuecomment-844279681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUBIJQQ2TADNDNSQLPMA3TOPTEPANCNFSM446CLEBQ .

-- Christophe Eymard

jupiter commented 3 years ago

It would be interesting to see a benchmark, because the situation is a little different for this library in comparison to Sqlite (sqlite and sqlite3), especially for multiple row fetches. Sqlite (sqlite and sqlite3) does an async call for every row it fetches, whereas node-duckdb only does an async call for the execute call (as per @rostislavdeepcrawl's comment above). The async call here allows for parallelism which in the case of lots of single row calls as per your suggestion above might actually be more performant.

anothersmith / node-duckdb

add a synchronous api #94