duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.14k stars 120 forks source link

INTERVAL processing seems broken in latest DuckDb WASM #1274

Closed antonycourtney closed 2 months ago

antonycourtney commented 1 year ago

What happens?

In a number of simple queries involving INTERVAL values, DuckDb WASM produces what appears to be a wrong result; it is at least inconsistent with the DuckDb CLI. Several basic examples:

Here are a few queries I tried on shell.duckdb.org and their results:

Database: v0.7.2-dev3783
Package:  @duckdb/duckdb-wasm@1.25.1-dev13.0

Connected to a local transient in-memory database.
Enter .help for usage hints.

duckdb> select to_days(5) as days;
┌────────────────────────────────────────────────────────┐
│ days                                                   │
╞════════════════════════════════════════════════════════╡
│ 0 years 0 mons 0 days 0 hours 0 mins 21.474836480 secs │
└────────────────────────────────────────────────────────┘
Elapsed: 19 ms

duckdb> select interval '1' day as days;
┌───────────────────────────────────────────────────────┐
│ days                                                  │
╞═══════════════════════════════════════════════════════╡
│ 0 years 0 mons 0 days 0 hours 0 mins 4.294967296 secs │
└───────────────────────────────────────────────────────┘
Elapsed: 5 ms

duckdb> SELECT (CURRENT_TIMESTAMP + INTERVAL '1' DAY) - CURRENT_TIMESTAMP as result;
Binder Error: No function matches the given name and argument types '+(TIMESTAMP WITH TIME ZONE, INTERVAL)'. You might need to add explicit type casts.
...

and here are the same queries executed with the DuckDb 0.8.0 CLI:

$ duckdb
v0.8.0 e8e4cea5ec
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select to_days(5) as days;
┌──────────┐
│   days   │
│ interval │
├──────────┤
│ 5 days   │
└──────────┘
D select interval '1' day as days;
┌──────────┐
│   days   │
│ interval │
├──────────┤
│ 1 day    │
└──────────┘
D SELECT (CURRENT_TIMESTAMP + INTERVAL '1' DAY) - CURRENT_TIMESTAMP as result;
┌──────────┐
│  result  │
│ interval │
├──────────┤
│ 1 day    │
└──────────┘
D

To Reproduce

Try any of the simple queries from above:

select to_days(5) as days;
select interval '1' day as days;
SELECT (CURRENT_TIMESTAMP + INTERVAL '1' DAY) - CURRENT_TIMESTAMP as result;

Browser/Environment:

Chrome Version 113.0.5672.92 (Official Build) (arm64)

Device:

M2 MacBook Pro

DuckDB-Wasm Version:

1.25.1-dev13.0

DuckDB-Wasm Deployment:

shell.duckdb.org

Full Name:

Antony Courtney

Affiliation:

MotherDuck

carlopi commented 1 year ago

Thanks for the report.

There are two separate problems here:

select to_days(5) as days;
select interval '1' day as days;

This is the main one, somehow related to arrow types, I am trying to find where it goes wrong and fix this bug. Note to self: SELECT 1::INTERVAL also fails in a funny way, without showing any result.

SELECT (CURRENT_TIMESTAMP + INTERVAL '1' DAY) - CURRENT_TIMESTAMP as result;

This is funny enough a "work as intended", as ICU is not included in the duckdb-wasm bundle. Building duckdb CLI with just make without explicitly enabling any extension and running the query shows the same behaviour.

This is more deep, and the proper approach either bundling ICU or moving to experimental extension loading.

I will update when I have a fix for the bug.

antonycourtney commented 1 year ago

Thanks for looking into this and the update @carlopi! Will monitor this issue for updates.

spren9er commented 1 year ago

Are there any updates on this?

I also ran into issues working with INTERVAL data in DuckDB-WASM.

Could it be that the mapping from INTERVAL DuckDB type to INTERVAL type of Arrow tables does not work properly?

When I load an .arrow file (created from a Parquet file) into an Arrow table in JS, for each INTERVAL data entry I get a Uint8Array(12), a representation of month, days, and milliseconds.

import { tableFromIPC } from 'apache-arrow';

const table = await tableFromIPC(fetch('interval.arrow'));
const values = table.getChild('interval').get(0);

I can convert this array into a human readable format via

getIntervalData(values);

with

function getIntervalData(values: Uint8Array) {
  const intervalMonths = base256ToDecimal(values.slice(0, 4));
  const intervalDays = base256ToDecimal(values.slice(4, 8));
  const intervalMilliSeconds = base256ToDecimal(values.slice(8, 12));

  const seconds = Math.floor(intervalMilliSeconds / 1000);
  const minutes = Math.floor(seconds / 60);
  const hours = Math.floor(minutes / 60);
  const days = Math.floor(hours / 24);

  const remainingMilliseconds = intervalMilliSeconds % 1000;
  const remainingSeconds = seconds % 60;
  const remainingMinutes = minutes % 60;
  const remainingHours = hours % 24;

  return {
    years: Math.floor(intervalMonths / 12),
    months: intervalMonths % 12,
    days: intervalDays + days,
    hours: remainingHours,
    minutes: remainingMinutes,
    seconds: remainingSeconds,
    milliseconds: remainingMilliseconds
  };
}

function base256ToDecimal(values: Uint8Array) {
  return values.reduce((agg: number, value: number, idx: number) => {
    return agg + value * Math.pow(256, idx);
  }, 0)
}

When I load the corresponding Parquet file with same data into DuckDB-WASM, data seems to be messed up. For each data entry I receive a Uint32Array(2) (via getChild('interval').get), which only holds 1/4 of information (months; rest is in subsequent three parts). In other words, vector length is four times longer than number of records. It is also not clear, how to get milliseconds from third and fourth part. I would have expected a similar structure like in Arrow table above.

Same Parquet file works flawless in DuckDB CLI.

duckdb_interval_data.zip

Y-- commented 6 months ago

Following-up on this issue: it seems that icu is not being auto-loaded (as other extensions such as json are)

Here with the latest shell:

DuckDB Web Shell
Database: v0.10.0
Package:  @duckdb/duckdb-wasm@1.28.1-dev153.0

Connected to a local transient in-memory database.
Enter .help for usage hints.

duckdb> select cast(to_timestamp(100000) at time zone 'PST' as date) as event_date
   ...> ;
Binder Error: No function matches the given name and argument types 'timezone(STRING_LITERAL, TIMESTAMP WITH TIME ZONE)'. You might need to add explicit type casts.
        Candidate functions:
        timezone(DATE) -> BIGINT
        timezone(TIMESTAMP) -> BIGINT
        timezone(INTERVAL) -> BIGINT
        timezone(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE

LINE 1: select cast(to_timestamp(100000) at time zone 'PST' as date) as event_da...
                                         ^

duckdb> load icu;
┌┐
└┘

duckdb> select cast(to_timestamp(100000) at time zone 'PST' as date) as event_date;
┌────────────┐
│ event_date │
╞════════════╡
│ 1970-01-01 │
└────────────┘

Is that expected?

carlopi commented 6 months ago

Hi @Y--, as you already noticed in the meantime, ICU is not tagged as autoloadable by duckdb (yet). I will look into sending a PR for that, preparatory work was already done in https://github.com/duckdb/duckdb/pull/9065, see in particular @Mytherin comment on what was missing back then.

Given ICU comes as loaded by default on most cases, I can see a reason to backport that PR also to duckdb-wasm on version v0.10.0.

matys1 commented 3 months ago

@carlopi Is there an update on this? I think I'm running into an error that is related to this issue. I'm using duckdb-wasm@1.28.1-dev221.0.

await connection.query(`
  BEGIN TRANSACTION;

  ALTER TABLE testData ADD COLUMN TripCreatedTz TIMESTAMPTZ;
  UPDATE testData SET TripCreatedTz = CAST(TripCreated AS TIMESTAMP) AT TIME ZONE 'UTC';

  COMMIT TRANSACTION;
`);

In chrome dev tools console:

Error: Binder Error: No function matches the given name and argument types 'timezone(STRING_LITERAL, TIMESTAMP)'. You might need to add explicit type casts.
    Candidate functions:
    timezone(DATE) -> BIGINT
    timezone(TIMESTAMP) -> BIGINT
    timezone(INTERVAL) -> BIGINT
    timezone(INTERVAL, TIME WITH TIME ZONE) -> TIME WITH TIME ZONE

LINE 5: ... TABLE testData ADD COLUMN TripCreatedTz TIMESTAMPTZ;
  UPDATE testData SET TripCreatedTz = CAST(TripCreated AS TIMESTAMP) AT TIME ZONE 'UTC';
                                                  ^
    at fa.runQuery (bindings_base.ts:168:19)
    at Ao.onMessage (worker_dispatcher.ts:202:51)
    at Hc.globalThis.onmessage (duckdb-browser-eh.worker.ts:29:19)

I'm new to DuckDB so either I'm doing something wrong or there's some issue with the time zone extension that DuckDB-WASM is using internally?

carlopi commented 2 months ago

There are a few independent problems here in the thread, trying to separate them:

@antonycourtney: intervals computations should now make more sense: duckdb> SELECT '2 minutes'::INTERVAL; ┌───────────────────────────────────────────────────────┐ │ CAST('2 minutes' AS INTERVAL) │ ╞═══════════════════════════════════════════════════════╡ │ 0 years 0 mons 0 days 0 hours 2 mins 0.000000000 secs │ └───────────────────────────────────────────────────────┘ Output is not 1-1 with DuckDB CLI, due to the fact that the console is handled Rust side with a different logic. Important part is that result does make sense. Problem was a weird problem with interpreting the Arrow protocol at the Arrow-Rust level. It has recently been solved in arrow-rust version 52, released last week, and now duckdb-wasm reaps the benefit of having this properly sorted out.

@Y--: There is yet another autoloading problem connected with ICU. I will need to open a bug to duckdb/duckdb on this. Problem is that even though ICU is autoloadable, in this particular case the autoloading mechanism is not working properly. Workaround, a bit heavy handed, is loading ICU explicitly. Problem is like: https://shell.duckdb.org/#queries=v0,select-cast(to_timestamp(100000)-at-time-zone-'PST'-as-date)-as-event_date~,LOAD-icu~,select-cast(to_timestamp(100000)-at-time-zone-'PST'-as-date)-as-event_date~

@matys1: I believe your problem to be similar to the previous one, but hard to be sure since there is not a complete reproduction. Could you please open another bug with a complete (= including table creation or so) repro? Thanks

I will close this for now, due to the main issue being solved. ICU-related problems can likely go to a separate one.

Thanks everyone.

matys1 commented 2 months ago

@carlopi actually I just figured that I can fix the issue by explicitly loading the icu extension by adding load icu; in the query transaction like below and it works.

await connection.query(`
  BEGIN TRANSACTION;

  load icu;

  ALTER TABLE testData ADD COLUMN TripCreatedTz TIMESTAMPTZ;
  UPDATE testData SET TripCreatedTz = CAST(TripCreated AS TIMESTAMP) AT TIME ZONE 'UTC';

  COMMIT TRANSACTION;
`);

I'm not sure if it's worth opening a new bug report since it seems you're already aware of the auto loading issue from other comments. Let me know though I have an example of this in Vite just in case you need.

carlopi commented 2 months ago

@matys1: I moved the issue here: https://github.com/duckdb/duckdb-wasm/issues/1770, this needs solving at the DuckDB level, but given the workaround is relevant is useful also to others I think it's more useful here.

And indeed your problem is the same as the one above, but added to the repro just in case to be able to double check.