Open ceymard opened 3 years ago
There are two parts to this:
execute
call which indeed causes duckdb to perform the query in another thread, unblocking the event loopfetch
call (which the streaming API uses) which retrieves data from duckdb and this happens on the main threadQuery execution can be long, I don't think that starving the event loop for such massive amounts of time is a good thing. But also, I don't see how doing so would "vastly" improve performance, since you're scheduling the long running operation to be processed outside the main thread only once.
The gains are mostly in the case where you exchange lots of successive queries with the database ; for instance hundreds or thousand consecutive updates that depend on the result of select queries.
Look at the benchmarks (and what is being benchmarked) by the better-sqlite3 library : https://github.com/JoshuaWise/better-sqlite3
The way I look at it is that the asynchronous API is really only useful
I think asynchronous is useful but probably not the main approach I'd be using when dealing with a database meant for processing data and not simply storing it.
I have scripts that handle big loads of data but that don't process anything in parallel. In that case, I don't mind blocking the process since anyways it would just be one async function waiting on the result of execute. I also have scripts that do a bunch of individual updates in quick sequence, in which case I do appreciate the speedup resulting from not deferring on to the event loop an operation that is almost instant since everything is local.
Anyways, I think there is a place for both approaches, and I'm also pretty certain that a synchronous access to a local database is much more efficient in a lot of scenarios (if not most) this particular database will end up being used in.
On Wed, May 19, 2021 at 6:43 PM Rostislav Provodenko < @.***> wrote:
There are two parts to this:
- the execute call which indeed causes duckdb to perform the query in another thread, unblocking the event loop
- and the fetch call (which the streaming API uses) which retrieves data from duckdb and this happens on the main thread Query execution can be long, I don't think that starving the event loop for such massive amounts of time is a good thing. But also, I don't see how doing so would "vastly" improve performance, since you're scheduling the long running operation to be processed outside the main thread only once.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/deepcrawl/node-duckdb/issues/94#issuecomment-844279681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFUBIJQQ2TADNDNSQLPMA3TOPTEPANCNFSM446CLEBQ .
-- Christophe Eymard
It would be interesting to see a benchmark, because the situation is a little different for this library in comparison to Sqlite (sqlite and sqlite3), especially for multiple row fetches. Sqlite (sqlite and sqlite3) does an async call for every row it fetches, whereas node-duckdb only does an async call for the execute
call (as per @rostislavdeepcrawl's comment above). The async call here allows for parallelism which in the case of lots of single row calls as per your suggestion above might actually be more performant.
As the node.js driver better-sqlite3 shows, when using in-process databases that do little to no IO, performances are vastly improved when using the database synchronously instead of having everything go through the event loop.
Since duck db is pretty focused on analytics, I actually doubt that this driver will be used in many other settings than a single user process crunching numbers.
Maybe it would be a good idea to add a synchronous way of speaking with duckdb ?