duckdb / duckdb-node

MIT License
48 stars 23 forks source link

Memory leak while reading parquet files from disk and web #22

Open joaomonteiro234 opened 10 months ago

joaomonteiro234 commented 10 months ago

What happens? I have a service that operates reading parquet files from disk and web (over HTTP). I believe that there is a memory leak causing node.js to run out of memory.

To Reproduce The following code read a small parquet file, only 1 line of data, from disk repeatedly and logs the memory usage. We can observe a increase of both rss and heap over time. I also created a graph containing the first 54000 iterations. Reading bigger parquet files leads to a bigger leak from what I have seen.

const duckdb = require('duckdb');

const db = new duckdb.Database(':memory:');

let i = 0;
function test() {
  db.all("SELECT * FROM READ_PARQUET('file.parquet')", () => {
    i += 1;
    const memory = process.memoryUsage();
    console.log('Iteration: ',  i, memory);

    test()
  });
}

test()

image

OS: Both windows and ubuntu

DuckDB Version: 0.9.1

davidflanagan commented 10 months ago

Do you still have the leak if you move the recursive call to test() out of the callback? This doesn't seem like a convincing test case.

If you can use setTimeout() to schedule the next call to test and still get a memory leak, that would be more convincing.

joaomonteiro234 commented 10 months ago

A conducted a new test as follows:

const db = new duckdb.Database(':memory:');

function test() {
  db.run("SELECT * FROM READ_PARQUET('file.parquet')");
  setTimeout(test, 100);
}
setTimeout(test, 100);

setInterval(() => {
  const memory = process.memoryUsage();
  console.log(`rss: ${memory.rss / (1024 * 1024)} heapTotal: ${memory.heapTotal / (1024 * 1024)} heapUsed: ${memory.heapUsed / (1024 * 1024)}`);
}, 60000);

image

image

Using timeout and moving the recursion call reduced the leak, but there is still a steady increase on the rss and heap. I also tried to use a file for the database instead of :memory:, but the leak persisted.

I'll let the code run for a few hours to see if the memory will stabilize at some point since it only ran for 40 minutes.

If you need any more information/support I will gladly provide it, thanks in advance.

joaomonteiro234 commented 10 months ago

I left it running for 12+ hours and those are the results:

image

davidflanagan commented 10 months ago

I'm a complete newbie with DuckDB, so take my comments with a grain of salt....

Maybe try your test case with a prepared statement? Right now you don't know if the memory leak is coming from the SQL statement parser or the result set of the parquet reader. Trying again with a prepared statement will help to narrow things down.

davidflanagan commented 10 months ago

Oh, also: if it ever takes more than 100ms to read the parquet file, then your test will end up doing more than one read at a time. I don't know how big your file is, but the test would probably be better structured if you move the setTimeout() call into a callback that you pass to db.run(). That way you can use a timeout of 0 and it will schedule the next call to test as fast as possible without ever having two reads running at the same time.

joaomonteiro234 commented 10 months ago

It was taking around 20ms to read the file, so there was no concurrent reading. Nonetheless I changed it and also tested with and without the prepared statement and the leak was still present on both scenarios. I also removed the READ_PARQUET function.

const duckdb = require('duckdb');

const db = new duckdb.Database(':memory:');
const con = db.connect();
const stmt = con.prepare('SELECT 42 AS fortytwo');

function test() {
  stmt.all(() => setTimeout(test))
}
setTimeout(test, 100);
nfadili commented 5 months ago

I think I might also be running into a memory leak issue using duckdb within a long running process (an api web server) - we have a health endpoint /readyz that sends the query db.all('SELECT 1');. It is called every 15 seconds or so by our kubernetes readiness probe. Our duckdb client is configured to use :memory:.

Screenshot 2024-04-02 at 3 15 56 PM

With a query this simple and essentially no data being pulled, i'm curious how/what could be building up in memory? Even if this kind of use case isn't supported, I assume regular queries our api server ends up making would build up over time as well. @joaomonteiro234 did you ever figure anything out here?

drecchia commented 5 months ago

I notice that memory usage stabilezes over 1Gb of RAM on NodeJs, even after closing db, so its not an major issue here, but for sure i would like to make it lower as on python instance ~20Mb.

nfadili commented 5 months ago

I just came across bug report #55 and the subsequent fix in #65 - this might fix the issue? I can't test until the duckdb-async package updates their version 🥲

joaomonteiro234 commented 5 months ago

@nfadili, unfortunately no, I was on a short schedule so I decided to stop using the library.

DrewScatterday commented 1 month ago

Still experiencing memory leak issues even after #65 with a very large parquet file

DrewScatterday commented 1 month ago

Ended up solving it. I think its a leak with a geoJSON function I'm using in the spatial extension. More details here https://github.com/duckdb/duckdb_spatial/issues/371