duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.03k stars 113 forks source link

insertArrowTable swaps column values #1565

Open maryhmhhu opened 6 months ago

maryhmhhu commented 6 months ago

What happens?

When using insertArrowTable, the values for columns id and key are swapped.

For below code example, logs of input to insertArrowTable

input 1 {id: '3', key: 0}
input 2 {id: '4', key: 1}

arrowTable 1 [
  {"id": "3", "key": 0}
]
arrowTable 2 [
  {"id": "4", "key": 1}
]

expected

row 1 {key: 0, id: 3}
row 2 {key: 1, id: 4}

actual

row 1 {key: 3, id: 0}
row 2 {key: 4, id: 1}

To Reproduce

With c being an AsyncDuckdbConnection:

      const rowNumCol = 'key';
      const someData = { id: '3', [rowNumCol]: 0 };
      const someData2 = { id: '4', [rowNumCol]: 1 };
      console.log('input 1', someData);
      console.log('input 2', someData2);

      await c.query(
        `CREATE TABLE someData("${rowNumCol}" INTEGER PRIMARY KEY, id INTEGER)`,
      );

      const arrowTable = arrow.tableFromJSON([someData]);
      console.log('arrowTable 1', arrowTable.toString());
      await c.insertArrowTable(arrowTable, {
        create: false,
        name: 'someData',
      });
      const arrowTable2 = arrow.tableFromJSON([someData2]);
      console.log('arrowTable 2', arrowTable2.toString());
      await c.insertArrowTable(arrowTable2, {
        create: false,
        name: 'someData',
      });

      for await (const batch of await c.send(`SELECT * FROM 'someData'`)) {
        let rowCount = 0;
        for (const row of batch) {
          rowCount += 1;
          const r = {};
          for (const [field, val] of row) {
            r[field] = val;
          }
          console.log(`row ${rowCount}`, r);
        }
      }

Browser/Environment:

Chrome Version 120.0.6099.129 (Official Build) (arm64)

Device:

Macbook

DuckDB-Wasm Version:

1.28.0

DuckDB-Wasm Deployment:

local

Full Name:

Mary Hu

Affiliation:

PopSQL

maryhmhhu commented 6 months ago

It seems the order of columns (in the array of objects as an input to https://arrow.apache.org/docs/js/functions/Arrow_dom.tableFromJSON.html) matters; the values are added based on the position in the object, and not based on the names of the columns.

If I add the column values in the same order as the table definiton, then the output is as expected. E.g. if I change the first few lines to

      const rowNumCol = 'key';
      const idCol = 'id';
      const someData = { [rowNumCol]: 0 };
      const someData2 = { [rowNumCol]: 1 };
      someData[idCol] = '3';
      someData2[idCol] = '4';

Then the logs of input is

input 1 {key: 0, id: '3'}
input 2 {key: 1, id: '4'}
arrowTable 1 [
  {"key": 0, "id": "3"}
]
arrowTable 2 [
  {"key": 1, "id": "4"}
]

and the output is

row 1 {key: 0, id: 3}
row 2 {key: 1, id: 4}
domoritz commented 6 months ago

It seems the order of columns (in the array of objects as an input to arrow.apache.org/docs/js/functions/Arrow_dom.tableFromJSON.html) matters; the values are added based on the position in the object, and not based on the names of the columns.

I don't follow. If this is an issue with Arrow, I can look into it.

const table = tableFromJSON([{
    a: 1,
    b: 2
}, {
    b: 2,
    a: 1,
}]);

console.log(table.toArray())

outputs

[ {"a": 1, "b": 2}, {"a": 1, "b": 2} ]
maryhmhhu commented 5 months ago

Hi @domoritz , the issue is only reproduced when the order is not the same in the first entry, but the issue doesn't manifest in the arrow table that's created, rather the reading of the duckdb table.

      const table = arrow.tableFromJSON([
        {
          a: 1,
          b: 2,
        },
      ]);

      console.log('table', table.toString());

outputs a correct arrow table

table [
  {"a": 1, "b": 2}
]

but when reading from duckdb after inserting the table (note - the order of columns in the table created is b then a)

      await c.query(`CREATE TABLE test(b INTEGER PRIMARY KEY, a INTEGER)`);

      await c.insertArrowTable(table, {
        create: false,
        name: 'test',
      });

      for await (const batch of await c.send(`SELECT * FROM 'test'`)) {
        for (const row of batch) {
          const r = {};
          for (const [field, val] of row) {
            r[field] = val;
          }
          console.log(`row`, r);
        }
      }

the output is not correct

{b: 1, a: 2}

I suspect somewhere within insertArrowTable the logic uses the insertion order of entries, instead of strictly looking at key value pairs, which would result in this swap.

domoritz commented 5 months ago

Thanks for the explanation. So it's probably a bug in DuckDB wasm, not arrow.