duckdb / duckdb_spatial

MIT License
489 stars 40 forks source link

DuckDB process killed when creating an R-Tree index #410

Closed CGenie closed 1 month ago

CGenie commented 1 month ago

I downloaded the OSM PBF file for Poland from here: https://download.geofabrik.de/europe.html

I created a duckdb table like this:

INSTALL spatial;
LOAD spatial;

CREATE TABLE nodes AS (SELECT * FROM ST_ReadOSM('poland.osm.pbf') WHERE kind = 'node';
ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
UPDATE nodes SET pt = ST_Point2D(lat, lon);

My DB size is around 6.5 GB. Now I try to do:

CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);

I see a progress bar, it goes up to around 28%, then I get a killed message, probably because Linux saw the process uses too much memory. Indeed, it raises to 18GB when looking at htop at the same time (I have 32 GB of RAM). I thought DuckDB is able to handle tasks larger than memory size?

I'm using Duckdb v 1.1.1.

BTW, it seems I get the same thing with creating a full-text-search index (https://duckdb.org/docs/extensions/full_text_search.html), maybe it's an issue when creating an index overall?

Maxxen commented 1 month ago

Hello! Have you tried setting a memory limit? All memory allocated by the R-Tree should be tracked by DuckDB and respect the memory limit parameter.

Are you running with a disk-backed database file or entirely in memory? You may want to have a look at the spilling-to-disk section as well.

CGenie commented 1 month ago

Well, with memory_limit set to 10G I get:

D set memory_limit = '10G';
D load spatial;
D CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);
 33% ▕███████████████████▊                                        ▏ INTERNAL Error: Attempted to dereference unique_ptr that is NULL!
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

I'll try disk spilling. The DB is a file, not in memory.

CGenie commented 1 month ago

OK I get the same error with disk spilling (i.e `SET tmp_dir = '...').

Maxxen commented 1 month ago

Im unable to reproduce this on DuckDB v1.1.0 MacOS in disk-backed mode (although without a memory limit).

load spatial;
D CREATE TABLE nodes AS SELECT * FROM ST_ReadOSM('poland-latest.osm.pbf') WHERE kind = 'node';
100% ▕████████████████████████████████████████████████████████████▏
D ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
D UPDATE nodes SET pt = ST_Point2D(lat, lon);
100% ▕████████████████████████████████████████████████████████████▏
D CREATE INDEX nodes_pt_idx ON nodes USING RTREE (pt);
100% ▕████████████████████████████████████████████████████████████▏
D

Ill see if I can mimic your setup and reproduce the error.

CGenie commented 1 month ago

Hm, ok, it seems to have finished just fine with memory limit as 15G.

CGenie commented 1 month ago

I'm on Linux btw.

Maxxen commented 1 month ago

If I set a 10gb limit I get a proper out-of-memory error before I even get to the index creation.

D SET memory_limit = '10gb';
D load spatial;
D CREATE TABLE nodes AS SELECT * FROM ST_ReadOSM('poland-latest.osm.pbf') WHERE kind = 'node';
100% ▕████████████████████████████████████████████████████████████▏
D ALTER TABLE nodes ADD COLUMN pt GEOMETRY;
D UPDATE nodes SET pt = ST_Point2D(lat, lon);
 73% ▕███████████████████████████████████████████▊                ▏ Out of Memory Error: failed to pin block of size 256.0 KiB (9.3 GiB/9.3 GiB used)

Nonetheless, a INTERNAL Error is never good, I'll look into setting up a linux box to dive into this further tomorrow.

CGenie commented 1 month ago

Yes, I guess even when you set memory_limit = '1G' it should still finish, though maybe it could take more time.

Maxxen commented 1 month ago

Not necessarily, the index itself must still be able to fit in memory. With the 217 708 491 nodes in the Poland extract you're looking at 217708491 * 40b = ~8.7Gb of raw index buffers, not including any additional bookkeeping structures in DuckDB.

CGenie commented 1 month ago

Hm, ok, but shouldn't it handle data larger than RAM with temp disk space?

Maxxen commented 1 month ago

Unfortunately, this is currently a limitation of all indexes in DuckDB. While they will be lazily loaded from disk once you open a database, they will not unload from memory, which is why you need to be able to keep the entire index in memory when you first create it. That said, their memory is still tracked by DuckDB, meaning that they should respect the SET memory_limit, option, and DuckDB will attempt to unload other temporary memory to make space for them.

I've pushed a fix for the INTERNAL Error: Attempted to dereference unique_ptr that is NULL error in #420