duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.03k stars 113 forks source link

Reading a remote parquet file with a simple WHERE clause results in loading more than twice its size. #1577

Open ericemc3 opened 6 months ago

ericemc3 commented 6 months ago

What happens?

My remote parquet file weighs 7,2 Mo. If i read it with a simple WHERE, more than 15 Mo pass through the network.

To Reproduce

CREATE OR REPLACE TABLE t AS FROM 'https://static.data.gouv.fr/resources/tables-aufilduboamp-2024/20240113-061700/boamp-panorama-2024-parquet-integral.parquet' ; =>7,2 Mo (Chrome devtools network inspector)

CREATE OR REPLACE TABLE t AS FROM 'https://static.data.gouv.fr/resources/tables-aufilduboamp-2024/20240113-061700/boamp-panorama-2024-parquet-integral.parquet'
WHERE P_35_Typemarche = 'SERVICES'  ;

15,6 Mo

OS:

Win11

DuckDB Version:

9.2

DuckDB Client:

shell wasm or cli

Full Name:

eric mauviere

Affiliation:

icem7

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

carlopi commented 6 months ago

Thanks a lot for the bug report.

This do not reproduces in duckdb CLI, where in both cases 7.5MB go through the network as per EXPLAIN ANALYZE.

This is a problem specific to the duckdb-wasm implementation of get requests, needs to be solved there. It's pretty bad since the multiplier can be even worse.

@szarnyasg: can you move it to duckdb-wasm repository?

szarnyasg commented 6 months ago

Thanks @carlopi for chiming in. I moved the issue.

ryan-williams commented 5 months ago

Related "discussions" about fetched data amount: