duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.18k stars 125 forks source link

insertArrowTable broken #1640

Closed dioptre closed 7 months ago

dioptre commented 7 months ago

Preamble

Seems the versions are very sensitive and you cant be using a newer version of arrow on an older release. While it appears to work with the newest version, there are still bugs while using insertArrowTable etc. I'm leaving all this here for others.

Part 2

1.28.1-dev132.0

Getting "Type metadata cannot be none" while passing to insertArrowTable or insertArrowFromIPCStream. A valid IPC / arrow table was used.

Edit: WORKAROUND / SOLVED

This works in duck:

    await _dbc.query("CREATE TABLE test_insert_2(g struct(t int)[]);");
    await _dbc.query("insert into test_insert_2 (g) values([{t:5}]);");
    tt2 = await _dbc.query("select * from test_insert_2;");
    await _dbc.insertArrowTable(tt2, {
       name: "test_insert_3",
       //create: false,
       schema: "main"
     });

As it introduces the list as a field with a name and metadata (that seems to be required) even though in arrow its not. So it seems a bug in duck but easy to workaround.

Just need to fill in the details here:

Screenshot 2024-02-22 at 4 03 30 PM

Part 1

What happens?

nothing - no error - nothing to report - no insertion

To Reproduce

with duck-wasm 1.28, main and others js-arrow 14.02

const somedata = [
      { key1: "a", key2: 1 },
      { key1: "b", key2: 2 }
    ];
    await c.query("CREATE TABLE someData(key1 VARCHAR, key2 integer)");
    const arrowTable = ar.tableFromJSON(somedata);
    await c.insertArrowTable(arrowTable, {
      name: "somedata",
      //create: true,
      schema: "main"
    });
    await c.insertArrowTable(EOS, { name:  "somedata"});
    console.log("somedata", (await c.query("select * from somedata")).toArray());

returns an empty array

Browser/Environment:

Mac FF and Chrome

Device:

M2

DuckDB-Wasm Version:

1.28

DuckDB-Wasm Deployment:

npm

Full Name:

Andrew Grosser

Affiliation:

Sourcetable

acuitymd-filip commented 7 months ago

I think you need to insert an EOS buffer (new Uint8Array([255, 255, 255, 255, 0, 0, 0, 0])) after the actual data, see this example: https://duckdb.org/docs/api/wasm/data_ingestion.html#apache-arrow

dioptre commented 7 months ago

I tried that and no luck too. Not sure that example is relevant anymore as the function it calls converts it to an ipc object before passing it on.

Also, I tried building from source, and replace everything as is, and seems the newest main version at least isn't loading the wasm files correctly. Is there some weird cross compilation thing I'm missing? Building on a mac.

Also this test appears broken. https://github.com/duckdb/duckdb-wasm/blob/7704412541e560b62c9d35bc4f72b66e188faa06/packages/duckdb-wasm/test/insert_arrow.test.ts#L229

This is in contradiction to the discussion here: https://github.com/duckdb/duckdb-wasm/discussions/1545

cjuko commented 7 months ago

I had the same issue. On a fresh @duckdb/wasm npm project, apache-arrow@11.0.0 is installed automatically. In this case, inserting using arrow works. Once you upgrade apache-arrow to 15.0.0, it fails.

jonathanswenson commented 7 months ago

I was also running into the same issue. TL;DR with careful selection of which arrow version was in my own package.json I was able to get things working.

Documented a little more here: https://github.com/duckdb/duckdb-wasm/discussions/1545#discussioncomment-8550373

I'd been experimenting with updating to 1.28.1-dev106.0 as it appears as though a new release might be coming and have noticed that I cannot load data (with or without the additional EOS buffer)

I was playing with different versions of apache-arrow in my own package.json and found that matching exactly (14.0.1 for dev106 and 15.0.0 for dev132) would allow me to successfully insert data WITHOUT having to add the EOS buffer.

dioptre commented 7 months ago

Ah legend thanks - moving to dev132 and 15 fixes it!

dioptre commented 7 months ago

Ah seems like there's still errors, with a valid arrow table I get:

Error: Type metadata cannot be none
    at _class.insertArrowFromIPCStream (bindings_base.ts:342:19)
    at _callee$ (worker_dispatcher.ts:370:36)
    at tryCatch (runtime.js:64:40)
    at Generator.invoke (runtime.js:299:22)
    at Generator.next (runtime.js:124:21)
    at asyncGeneratorStep (_async_to_generator.js:3:28)
    at _next (_async_to_generator.js:20:17)
    at _async_to_generator.js:27:13
    at new Promise (<anonymous>)
    at _async_to_generator.js:16:16
async_bindings.ts:150 Uncaught (in promise) Error: Type metadata cannot be none
    at _class.onMessage (async_bindings.ts:150:23)
jonathanswenson commented 7 months ago

I just tried this on my current working version (1.28.0 -- with arrow 13.0.0):

const byteArray = Uint8Array.from([
  255,255,255,255,224,0,0,0,16,0,0,0,0,0,10,0,16,0,14,0,7,0,8,
  0,10,0,0,0,0,0,0,1,16,0,0,0,0,0,4,0,8,0,8,0,0,0,4,0,8,0,0,0,
  4,0,0,0,1,0,0,0,4,0,0,0,176,255,255,255,20,0,0,0,0,0,0,1,20,
  0,0,0,0,0,0,12,140,0,0,0,1,0,0,0,103,0,0,0,1,0,0,0,20,0,0,0,
  16,0,16,0,0,0,0,0,11,0,12,0,0,0,4,0,16,0,0,0,12,0,0,0,0,0,0,
  0,84,0,0,0,1,0,0,0,20,0,0,0,16,0,24,0,4,0,11,0,19,0,20,0,0,0,
  12,0,16,0,0,0,20,0,0,0,0,0,0,1,20,0,0,0,0,0,0,2,24,0,0,0,1,0,
  0,0,116,0,0,0,0,0,0,0,8,0,12,0,8,0,7,0,8,0,0,0,0,0,0,1,32,0,
  0,0,252,255,255,255,4,0,4,0,4,0,0,0,0,0,0,0,255,255,255,255,
  216,0,0,0,20,0,0,0,0,0,0,0,12,0,22,0,20,0,15,0,16,0,4,0,12,0,
  0,0,16,0,0,0,0,0,0,0,0,0,0,3,16,0,0,0,4,0,10,0,24,0,12,0,8,0,
  4,0,10,0,0,0,20,0,0,0,104,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,
  0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,8,0,0,0,0,0,0,0,8,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,
  1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,5,0,0,0,
  0,0,0,0,255,255,255,255,0,0,0,0
])

const dataTable = tableFromIPC(byteArray)
console.log(JSON.stringify(dataTable.toArray()))
await connection.insertArrowTable(dataTable, {
  name: "table_name"
})

const results = await connection.query("select * from \"table_name\"")
console.log(JSON.stringify(results.toArray()))

This seems to work just fine, but the original inserted table (dataTable) and the query result (result) seem to not successfully be capturing the nested struct. output of the above for me is:

[{"g":[null]}]

Wondering if this is a problem with the nested struct (or duckdb's compatibility with loading in arrow structs).

dioptre commented 7 months ago

The newest version works for me, but I had to prefix the "new List" like so:

new List(Field.new({ name: "l", metadata: new Map(), {...other details you need in both arrow and duck}

That matches up with the schema that duck produces in 1.28.1-dev132.0

@carlopi ever changing this will be tragic for people - would love to get a thought on this

Thanks for you help @jonathanswenson

Think the older version may not have support structs as well?