Arrow table from query result much larger than equivalent inserted table

duckdb / duckdb-wasm

WebAssembly version of DuckDB

https://shell.duckdb.org

MIT License

1.02k stars 110 forks source link

Arrow table from query result much larger than equivalent inserted table #1741

Open keller-mark opened 1 month ago

keller-mark commented 1 month ago

What happens?

Arrow tables returned by conn.query are much larger than expected due to lack of usage of Dictionary encoding. In the web browser, large (>1 million row) tables that can be inserted into the database successfully cannot be subsequently queried in their entirety due to the memory footprint of the returned Arrow table.

To Reproduce

This Observable notebook contains a minimal reproduction: https://observablehq.com/d/56ceaa780133858a

Browser/Environment:

Firefox Developer Edition 127.0b4

Device:

Macbook

DuckDB-Wasm Version:

1.24.0

DuckDB-Wasm Deployment:

Observable Notebook

Full Name:

Mark Keller

Affiliation:

Harvard Medical School

keller-mark commented 1 month ago

I think this comes down to these lines:

const buffer = await this._bindings.runQuery(this._conn, text);
const reader = arrow.RecordBatchReader.from<T>(buffer);

On the JS side, the arrow.RecordBatchReader could potentially be modified to return dictionary-encoded columns. However if the buffer passed is not already dictionary-encoded, then the memory issue may persist due to the buffer being very large.

domoritz commented 1 month ago

Hmm, but if it's a buffer, doesn't that mean that JS here is just instantiating an arrow record batch from IPC and the IPC dictates the schema and so we can't just dictionary encode after the fact?

keller-mark commented 1 month ago

Yes that is what i was trying to convey with

the memory issue may persist due to the buffer being very large

i.e., that the fix might need to be on the C++ side

keller-mark commented 1 month ago

~That aside, on the JS side, couldn't the dictionary encoding be performed in incremental fashion as the buffer is parsed? For example you could keep a set of values that have been "seen" already and if an unseen value is encountered, add a mapping to the dictionary~

EDIT

I suppose the above statement does not make sense if there is no "parsing" step https://github.com/apache/arrow/blob/ff9921ffa89585be69ae85674bb365d03cb22ba4/js/src/ipc/reader.ts#L357 - then the only option would be on the C++ side

domoritz commented 1 month ago

Yep, that's what I'm thinking as well but I may be wrong.