WiseLibs / better-sqlite3

The fastest and simplest library for SQLite3 in Node.js.
MIT License
5.52k stars 394 forks source link

iterate(): small but important doc clarification #406

Open lsemprini opened 4 years ago

lsemprini commented 4 years ago

Thanks for your awesome library! The KISS design and lack of async and other zombie borg "you will be assimilated" thinking is truly a breath of fresh air!

The dox for iterate() at

https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/api.md

don't actually say what the memory usage of iterate() is.

If iterate() guarantees to only hold a certain constant number of rows in memory at once in the worst-case (all the way down to the C library level), then this is SUPER useful to know and should be documented.

If Iterate() is actually a wrapper around all(), or otherwise could hold the whole record set in memory in the worst-case (at any level) then this is ALSO super useful to know and should be documented.

Right now, there is a comment about performance vs. all() which makes me lean towards the second interpretation; whatever is true should be explicitly documented.

JoshuaWise commented 4 years ago

So here's the thing. The iterate() function is supposed to work as your describe in your first example. However, some of the memory that's allocated is managed by the JavaScript garbage collector, which we don't have control over. So theoretically, you should be able to iterate over a huge number of rows without too many being held in memory at a time, but I can't provide such guarantees without also having such a guarantee from the garbage collector, which I don't have because I don't own that code.

But for all practical purposes, it will probably work as you expect in your first example. Any memory that this library allocates (that isn't managed by v8), is de-allocated after each loop iteration. The iterate() function is not a wrapper around all().

Also:

The KISS design and lack of async and other zombie borg "you will be assimilated"

This made me laugh.

lsemprini commented 4 years ago

Interesting. The main question is "when fetching a large result set, what is the wost-case amount of memory that better-sqlite3 will allocate and hold onto during the full iteration, and is that amount proportional to the number of results?" Unreferenced memory held by the JS engine is less of a concern because there should only be able to be a constant amount of it before the GC kicks in, so that is not proportional to the number of results. So some dox like this will do the trick:

Unlike all(), iterate() lets go of its reference to the memory for a given result each time you iterate forward to the next result. So, when iterating through large result sets with iterate() (and assuming you also let go of your references to the older results), the peak amount of memory referenced at any given moment will be proportional to the size of the current result, but not proportional to the number of results or the total size of all results combined.

This is important if you are using better-sqlite3 with large result sets in strict memory-limited environments (e.g. if the JavaScript engine itself is configured with memory limits via --max-old-space, or the JavaScript engine process is subject to being killed when it exceeds certain memory limits, e.g. UNIX rlimits from a web server in a CGI script). In such an environment, it is still possible for the JavaScript engine to hold onto operating system memory (memory that nobody references but that the JavaScript engine has not yet garbage-collected) but the worst-case amount of such memory should be constant and not proportional to the number of records in the result.

lsemprini commented 4 years ago

I think it would be helpful to include some dox like the example one in the last message in the official dox for iterate() at:

https://github.com/JoshuaWise/better-sqlite3/blob/master/docs/api.md

pelletier197 commented 1 year ago

I agree it would be helpful to have that in the documentation. It got me a little confused, because what I observe seems to be much more like the second example.

When I execute iterate on a query that returns 10m rows, the query first takes a whopping full minute to return the first result, and during that minute, you can see the memory increasing steadily up to 6Gb of RAM 👀. And then, once the iteration begins and I receive the first result, the memory starts decreasing slowly as i'm iterating through the records.

This question is a bit more than three years old, so maybe this is not the best place to ask this, but has something changed since then ? And this should definitely be documented if this is the actual behavior, because I used this function thinking it would be awesome and use little memory, and I got quite disappointed when I saw that😅.

JoshuaWise commented 1 year ago

@pelletier197 I am not able to reproduce the behavior you described.

I filled a database with 10,000,000 rows (~5GiB of data), using the following script:

const db = require('better-sqlite3')('test.db');
db.pragma('journal_mode = MEMORY'); // Makes it faster
db.pragma('locking_mode = EXCLUSIVE'); // Makes it faster
db.exec('BEGIN EXCLUSIVE; COMMIT'); // // Needed for locking_mode = EXCLUSIVE

db.exec(`
    CREATE TABLE data (
        id INTEGER PRIMARY KEY,
        chunk BLOB NOT NULL
    ) STRICT;
`);

const chunk = Buffer.alloc(512);
const stmt = db.prepare('INSERT INTO data (id, chunk) VALUES (?, ?)');
for (let id = 0; id < 10000000; ++id) {
    stmt.run(id, chunk);
}

db.close();

Then I ran the following script to iterate through all the rows:

console.log('opening database...');
const db = require('better-sqlite3')('test.db', { readonly: true });

console.log('preparing statement...');
const stmt = db.prepare('SELECT chunk FROM data ORDER BY id ASC').pluck();

console.log('running query...');
const iterator = stmt.iterate();

let index = 0;
console.log('getting row #0...');
for (const _ of iterator) {
    console.log(`getting row #${++index}...`);
}

console.log('done!');

The process memory usage never went above ~70MiB. Then I tried replacing .iterate() with .all() and the memory usage went to 5GiB as expected.

JoshuaWise commented 1 year ago

Note that if you're using RETURNING in your query, SQLite will buffer all the results internally: https://www.sqlite.org/lang_returning.html#processing_order

pelletier197 commented 1 year ago

@JoshuaWise thank you for your help. I realised that there was an ORDER BY in the query that SQLite didn't like much, which got it to store the whole query result in memory. That does make plenty of sense when you think about it, and I wasn't aware of that order by clause. I removed it and now it runs smoothly.

LRagji commented 1 week ago

@pelletier197 , That seems an interesting find of orderby clause, the question is where did you get that clue, and does it also happens for numeric primary key columns cause that's what I assume becomes clustered Index for table in which they are stored.. also doesnt SQLite implement merge sort on disk for large dataset? A need for every database cause they never know what size the result will be.. or I am missing something..