duckdb / duckdb_spatial

MIT License
489 stars 40 forks source link

Issues writing GeoParquet output with various compression algorithms #398

Closed cholmes closed 2 months ago

cholmes commented 2 months ago

I've been working with a ~10 million row dataset, trying to get it to write out a GeoParquet, and it's been having a number of hard crashes. I think I narrowed it down to when it's writing out the Parquet, and different compression formats perform differently:

zstd fails:

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'zstd');
duckdb(66080,0x1e8422100) malloc: Heap corruption detected, free list is damaged at 0x600001dc9400
*** Incorrect guard value: 4394346680
duckdb(66080,0x1e8422100) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      duckdb test.duck

snappy fails:

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'snappy');
zsh: segmentation fault  duckdb test.duck

brotli works, but takes 10+ minutes. And gzip works perfectly, and I think was under a minute. Scratch that, my second attempt to do gzip and measure the time (as it was more like 10 seconds) failed halfway through:

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'gzip');
 64% ▕██████████████████████████████████████▍                     ▏ duckdb(71353,0x16bf77000) malloc: Heap corruption detected, free list is damaged at 0x600003e02300
*** Incorrect guard value: 4393740472
duckdb(71353,0x16bf77000) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      duckdb test.duck

And then future ones didn't go anywhere:

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'gzip');
zsh: segmentation fault  duckdb test.duck
(geo3) cholmes@Chriss-MacBook-Pro ps-footprints % duckdb test.duck
v1.1.0 fa5c2fe15f
Enter ".help" for usage hints.
D load spatial;
D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'gzip');
zsh: segmentation fault  duckdb test.duck

If I exclude the geometry all works and is 5-6 seconds:

COPY (select * EXCLUDE geometry from ps) to 'output/duckdb-no-geom-out.parquet' (FORMAT 'parquet');

Also tested uncompressed (failed):

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'uncompressed');
duckdb(70131,0x1702b7000) malloc: Heap corruption detected, free list is damaged at 0x600000fe5ae0
*** Incorrect guard value: 4323223736
duckdb(70131,0x1702b7000) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      duckdb test.duck

And lz4 almost works - super fast to 99%, then failed:

D COPY ps TO 'output/duckdb-out2.parquet' (FORMAT 'parquet', COMPRESSION 'lz4');
 99% ▕███████████████████████████████████████████████████████████▍▏ duckdb(69856,0x1e8422100) malloc: Heap corruption detected, free list is damaged at 0x600001602d00
*** Incorrect guard value: 4357761208
duckdb(69856,0x1e8422100) malloc: *** set a breakpoint in malloc_error_break to debug
zsh: abort      duckdb test.duck

I tried out the GDAL outputs, but none of them work as this is STAC GeoParquet and has a lot of nested structs:

D copy ps to 'output/out.fgb' WITH (FORMAT GDAL, Driver 'FlatGeobuf');
Not implemented Error: Unsupported type for OGR: STRUCT(thumbnail STRUCT("auth:refs" VARCHAR[], href VARCHAR, roles VARCHAR[], title VARCHAR, "type" VARCHAR))

The schema is:

CREATE TABLE ps(assets STRUCT(thumbnail STRUCT("auth:refs" VARCHAR[], href VARCHAR, roles VARCHAR[], title VARCHAR, "type" VARCHAR)), bbox STRUCT(xmin DOUBLE, ymin DOUBLE, xmax DOUBLE, ymax DOUBLE), id VARCHAR, updated TIMESTAMP WITH TIME ZONE, geometry GEOMETRY, collection VARCHAR, "type" VARCHAR, links STRUCT("auth:refs" VARCHAR[], href VARCHAR, rel VARCHAR, title VARCHAR, "type" VARCHAR)[], "auth:schemes" STRUCT(openid STRUCT("type" VARCHAR, description VARCHAR, openIdConnectUrl VARCHAR)), stac_extensions VARCHAR[], stac_version VARCHAR, constellation VARCHAR, datetime TIMESTAMP, "eo:cloud_cover" INTEGER, "eo:snow_cover" INTEGER, gsd DOUBLE, instruments VARCHAR[], "pl:clear_percent" BIGINT, "pl:ground_control" BOOLEAN, "pl:item_type" VARCHAR, "pl:pixel_resolution" BIGINT, "pl:publishing_stage" VARCHAR, "pl:quality_category" VARCHAR, "pl:strip_id" VARCHAR, platform VARCHAR, updated_1 TIMESTAMP, published TIMESTAMP, "view:azimuth" DOUBLE, "view:off_nadir" DOUBLE, "view:sun_azimuth" DOUBLE, "view:sun_elevation" DOUBLE, "proj:epsg" BIGINT, "proj:shape" BIGINT[]);

I'm on OS/X on M2, with DuckDB 1.1. Happy to share more info if needed.

Maxxen commented 2 months ago

Hi! This seems like a duplicate of https://github.com/duckdb/duckdb/issues/13914 which was fixed yesterday. Bugfix release is scheduled for monday.

cholmes commented 2 months ago

Oh great! I'll try it out.

cholmes commented 2 months ago

Just tried the nightly on os/x - v1.0.1-dev4917 45787e5f9f and I see the same problem, though maybe it wasn't in yet? Or maybe the nightlies aren't on 1.1 yet? Happy to try anything out, or to wait for monday.

Maxxen commented 2 months ago

Hmm, that seems like an old nightly build? The fix should be included in the 1.1.1 nightlies, e.g. v1.1.1-dev238 f0f78913a5 (installed from https://duckdb.org/docs/installation/?version=main&environment=cli&platform=macos&download_method=direct)

cholmes commented 2 months ago

Yeah, I was surprised to get v1.0.1 when I clicked download - I'm pretty sure I clicked the link just like the one you shared. But just did it again and got v1.1.1, so must have been user error or some weird quirk. Will try to try it today. Thanks!

cholmes commented 2 months ago

Worked great! Thanks!