Executing other queries while iterating through a SELECT statement

alixaxel commented 5 years ago

I'm using the latest version of better-sqlite3 and I have some simple code that looks like the following:

const statements = {
  fetch: db.prepare(`SELECT "id", "data" FROM "migrations" WHERE "done" = 0;`),
  migrate: db.prepare(`UPDATE "migrations" SET "done" = ? WHERE "id" = ? LIMIT 1;`),
};

db.exec('BEGIN;');

for (let row of statements.fetch.iterate()) {
  statements.migrate.run(1, row.id);
}

db.exec('COMMIT;');

Whenever I try to run it however, I always get the following error:

      statements.migrate.run(1, 42);
                              ^
TypeError: This database connection is busy executing a query

How can I update a row while iterating through it? Surely it must be possible?

JoshuaWise commented 5 years ago

Actually no, SQLite3 can only execute one statement at a time. While iterating through a result set with .iterate(), the database connection is considered busy.

The solution is to just retrieve all rows before updating them.

for (let row of statements.fetch.all()) {
  statements.migrate.run(1, row.id);
}

alixaxel commented 5 years ago

@JoshuaWise Thanks for the follow-up, I'm under the impression that I did the same (acquire cursor, iterate + update) in PHP with pdo-sqlite extension long ago and it worked fine. My memory could be failing me though...

alixaxel commented 5 years ago

@JoshuaWise Confirmed - it works with PHP & SQLite:

DB('sqlite://iterate_update.db');
DB('CREATE TABLE IF NOT EXISTS "migration" ("done" INTEGER, "id" INTEGER PRIMARY KEY);');

for ($i = 0; $i <= 100; $i++) {
    DB('INSERT OR IGNORE INTO "migration" ("done", "id") VALUES (0, ?);', $i);
}

var_dump(DB('SELECT COUNT(1) FROM "migration" WHERE "done" = ?;', 0)->fetch()); // 100

$rows = DB('SELECT * FROM "migration" WHERE "done" = ?;', 0); // acquires cursor

while ($row = $rows->fetch()) { // iterates and updates within cursor
    DB('UPDATE "migration" SET "done" = ? WHERE "id" = ?;', 1, $row->id);
}

var_dump(DB('SELECT COUNT(1) FROM "migration" WHERE "done" = ?;', 0)->fetch()); // 0

So this must be something specific to either better-sqlite3 or just Node.js itself. I can provide you with the DB (wrapper around PDO) function above if you've interest in replicating it yourself.

JoshuaWise commented 5 years ago

@alixaxel Hmmm, I've just reviewed the C API and it seems actually you're right. I seem to remember thinking it would be better to prevent other DB actions while iterating, but I can't remember the reasoning now! I'll think on this, and perhaps remove the restriction in a future version.

My statement from before, "SQLite3 can only execute one statement at a time," does apply to parallel or asynchronous execution, but not to synchronous execution.

JoshuaWise commented 5 years ago

It should be noted that this is already possible by using two separate database connections. However, it's not an ideal workaround because you wouldn't be able to put the two iterations in the same transaction.

titoBouzout commented 5 years ago

Please fix this whenever possible you have time and motivation, this is very basic for a dbs....., thank you!

JoshuaWise commented 5 years ago

As of v5.4.0, you can now execute read-only statements (.get(), .all(), .iterate()) while iterating through a result set (.iterate()). However, SQL that mutates the database (i.e., .run()) is still not allowed.

"Why?", you might ask. Well, according to this official SQLite3 documentation, the behavior is undefined when attempting to do this. If you can do this in PHP and it works as you expect it to, you're only getting lucky—the behavior might change in any new release of SQLite3. This is not a style of operation that any application using SQLite3 should rely on. It's not safe.

better-sqlite3 will prevent you from doing unsafe things.

alixaxel commented 5 years ago

@JoshuaWise Sorry for only following up now on this but I was on holidays with limited internet availability.

Regarding your rationale and after reading the documentation, I would say that undefined != unsafe.

Within a single database connection X, a SELECT statement always sees all changes to the database that are completed prior to the start of the SELECT statement, whether committed or uncommitted. And the SELECT statement obviously does not see any changes that occur after the SELECT statement completes. But what about changes that occur while the SELECT statement is running? What if a SELECT statement is started and the sqlite3_step() interface steps through roughly half of its output, then some UPDATE statements are run by the application that modify the table that the SELECT statement is reading, then more calls to sqlite3_step() are made to finish out the SELECT statement? Will the later steps of the SELECT statement see the changes made by the UPDATE or not? The answer is that this behavior is undefined. In particular, whether or not the SELECT statement sees the concurrent changes depends on which release of SQLite is running, the schema of the database file, whether or not ANALYZE has been run, and the details of the query. In some cases, it might depend on the content of the database file, too. There is no good way to know whether or not a SELECT statement will see changes that were made to the database by the same database connection after the SELECT statement was started. And hence, developers should diligently avoid writing applications that make assumptions about what will occur in that circumstance.

I would prefer being responsible enough to choose when I might need to execute a UPDATE inside a SELECT (and deal with the potential problems that might create) than not be able to do that at all.

A simple, safe example would be updating a table not referenced by the current SELECT query.

Again from the docs:

If an application issues a SELECT statement on a single table like "SELECT rowid, * FROM table WHERE ..." and starts stepping through the output of that statement using sqlite3_step() and examining each row, then it is safe for the application to delete the current row or any prior row using "DELETE FROM table WHERE rowid=?".

So writing to current or previous rows of the cursor is safe and well-defined. This is very useful IMO.

The undefined outcome seems to only refer to rows written ahead of the cursor:

It is also safe (in the sense that it will not harm the database) for the application to delete a row that expected to appear later in the query but has not appeared yet. If a future row is deleted, however, it might happen that the row turns up after a subsequent sqlite3_step(), even after it has allegedly been deleted. Or it might not. That behavior is undefined. The application can also INSERT new rows into the table while the SELECT statement is running, but whether or not the new rows appear in subsequent sqlite3_step()s of the query is undefined. And the application can UPDATE the current row or any prior row, though doing so might cause that row to reappear in a subsequent sqlite3_step().

And the docs finish with the same warning/recommendation to developers:

As long as the application is prepared to deal with these ambiguities, the operations themselves are safe and will not harm the database file. For the purposes of the previous two paragraphs, two database connections that have the same shared cache and which have enabled PRAGMA read_uncommitted are considered to be the same database connection.

It would be great if you would reconsider unblocking this behavior and leave isolation concerns to the users. As it is, it's making too many assumptions that will often times be wrong and limitative.

JoshuaWise commented 5 years ago

Unless your situation is very bottlenecked by performance, there's almost certainly another way of accomplishing the same ultimate result, albeit less conveniently.

Should better-sqlite3 open up the potential for undefined behavior for thousands of users, just so a few experts can do something a little more conveniently? I'm inclined to answer "no". However, supporting some kind of "unsafe mode" might be worthwhile if it can be proven that there are common-enough problem domains that require this functionality, that cannot solved with a less convenient approach.

titoBouzout commented 5 years ago

You can just add an option on the lines of "use_unsafe" :true. with some sentences explaining the trade offs.

Im strongly against that trend that I see more and more from people that creates software, browsers, libraries, editors, that kinda make it impossible for people to get what they want because of "by design" reasons. Please don't take me wrong, I really appreciate a lot you sharing your work, with 1 user or with thousands it does not matter, you taking the effort to share it, which means a lot to me personally. I also understand you concerned for good reasons. I think we should always be giving people the power to decide by themselves, even if in a few cases they don't understand they shooting themselves. The power you give people that understand what they doing by far exceeds these that are learning or make mistakes.

alixaxel commented 5 years ago

@JoshuaWise I'd like to try to change your mind about this, so let me explain where I'm coming from.

I see SQLite has a super-efficient heavy duty DB, specially useful for storing and operating on a large amount of records. I have several multi-GB SQLite DBs and I usually get better performance with it than with a conventional client-server RDBMS due to the lack of network overhead, etc. And better-sqlite3 is the only implementation in the whole JS ecosystem that is not only decent, but also supports UDFs.

Currently I'm working with a SQLite DB that contains a couple dozen million queued migrations: I'm taking a schemaless document-oriented DB that has a very inconsistent structure (with years of GIGO practice), casting that inconsistent structure to a more consistent one, and writing it back to SQLite with done = 0. Then, after reviewing those queued migrations, I'm iterating through them, applying the corrected data to the schemaless DB and marking them as done = 1. You mentioned repetitively, that there's probably another/better way of accomplishing the same. And you're right, there are some other possibilities:

Collect all the rowids that should be marked as done, and write those after iterating.

Which might be unfeasible if the amount of available memory isn't enough to keep a copy of all the rowids to update. And even if that's possible, it makes writing consistent code more complex (what if halfway through, applying one specific migration in the other schemaless DB fails, etc).

Use .all() instead in a while (true) loop with a query like: ... WHERE "done" = 0 LIMIT 128.

Which is what I'm using to overcome this limitation. However, the query performance is not as efficient, since the query has to be constantly re-evaluated and on memory-mapped result set is stale after each execution.

Besides those, I don't see any other viable possibility.

Given that cursor based iteration is the usual way to approach larger-than-RAM result sets, I would argue that neither of those alternatives is really the best one. When dealing with large datasets (again, one of the areas where SQLite really shines) the first option will often not be possible, and the second will not be as fast. And neither of those are as convenient / straightforward / elegant as the idiomatic way of:

db.exec('BEGIN;');

for (let row of statements.fetch.iterate()) {
  statements.migrate.run(1, row.id);
}

db.exec('COMMIT;');

Which, according to the documentation, is not only safe but well-defined too (since we are mutating the data that was already consumed by the cursor) - I would also argue that 99% of the times you mutate data within a cursor, you are mutating the same record you're iterating on, and not prior or ahead records.

If someone is reckless enough to mangle rowid + 1 while iterating forward, they will likely bump into the issue (if there's any; as I assume the undefined behavior will be consistent given the same SQLite version/environment) early on and find another/better way of doing the same.

It seems to me that this design decision is preventing a technique that I see as a best-practice in favor of protecting users that adopt bad practices from their own mistakes. This is very limitative IMO, and since SQLite itself didn't chose to block that behavior, why should this package do that?

To give you another example, in another multi-GB SQLite DB where I have nearly a billion records, I:

iterate on table A
do some heavy calculations
store the results of those computations in table B

Which again, is totally safe and well-defined. I do this currently in Golang, without any problem.

However, with better-sqlite3 and this design decision, it would be a pain in the ass to work around it.

I appreciate the work you put into this package regardless. :)

airs0urce commented 5 years ago

I got same issue. I have two tables:

pages_html // millions records here id - page id content - html content
companies id - numeric if of company url - company url

I want to iterate all records from "pages_html" table, analyse "content" field, get all links to companies from there and insert to "companies" table. But I can't do that, because "pages_html" doesn't fit in memory.

I solved it ugly way, but it works for me - I created two databases and store one table in each db.

airs0urce commented 5 years ago

I also tried to use two connections like suggested above:

It should be noted that this is already possible by using two separate database connections. However, it's not an ideal workaround because you wouldn't be able to put the two iterations in the same transaction.

This is part of code that shows how I tried it with two connections:

const connSearchPagesHtml = new Database(__dirname + '/../database.sqlite');
const connCompanies = new Database(__dirname + '/../database.sqlite');

const stmtSelectPages = dbSearchPagesHtml.prepare('SELECT * FROM search_pages_html');
const stmtInsertCompany = dbCompanies.prepare('INSERT INTO companies(id, url) VALUES(@id, @url)');

for (const row of stmtSelectPages.iterate()) {
    stmtInsertCompany.run({id: '', url: ''});
}

But I'm getting this error:

Error: SqliteError: database is locked

About undefined behavior. If sqlite developers allowed something with undefined behavior and even described it in docs, it's not because it was hard for the developers to restrict it, there is clear reason behind it. And by my opinion database's driver should not decide what features of database should be restricted.

gbouthenot commented 5 years ago

Please reconsider. They are some very good reasons to allow to iterate over a result from a table while updating another table. This is not an anti-pattern, and is the way to go. It is performant and elegant. I don't say it is safe to to a query and update one the same table, but doing it on two different tables is safe. Should your (wonderful) library prevent from unsafe, or undefined operations because ot the API of Sqlite3 ? Maybe, but please let people who need it, and assume they know what they are doing. Please reconsider and add an option to do that. Thank you very much for all your work, you lib is awesome.

missinglink commented 4 years ago

Hi all, I had a crack at this today in https://github.com/JoshuaWise/better-sqlite3/pull/367 It still requires more nodejs tests, so please contribute some test cases!

JoshuaWise commented 4 years ago

This is now possible in v7.0.0 via unsafe mode

tintin10q commented 2 years ago

I don't know if this is already mentioned somewhere but I was hitting this problem as well and to get around it I just created two connections to the database.

const reader = new Database("DB.sqlite3");
const writer = new Database("DB.sqlite3");

Then just prepare the writing statements with the writer connection and the reader statements with the reader connection. Simple solution and it works great! Although I was writing to a different table than I was reading from.

Edit I now see that this was already mentioned. Sorry!

WiseLibs / better-sqlite3

Executing other queries while iterating through a SELECT statement #203