duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.02k stars 110 forks source link

INTERVAL bug (wrong endian?) #1696

Closed david542542 closed 2 weeks ago

david542542 commented 2 months ago

What happens?

Running SELECT INTERVAL '2 year' (or really any other INTERVAL) produces the wrong output in the wasm shell. It seems like it has to do with endianness: 2 years is 24 months and here 24 is put into fractional seconds, i.e at the other end of the number.

image

To Reproduce

You can run this query here (click to open in wasm shell).

SELECT INTERVAL '2 day'

OS:

Mac > Chrome > DuckDB Wasm Shell

DuckDB Version:

v0.10.1

DuckDB Client:

Wasm

Full Name:

David Litwin

Affiliation:

None

Have you tried this on the latest nightly build?

I have tested with a nightly build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

Mause commented 2 months ago

@Mytherin I think this is actually an issue with DuckDB core's arrow layer, as I see the same issue with duckdb-rs. I did briefly discuss this with @Maxxen as well

Mytherin commented 2 months ago

Why does the problem then not occur in Python?

>>> import duckdb
>>> duckdb.sql("SELECT INTERVAL '2 year'").arrow()
# pyarrow.Table
# CAST('2 year' AS INTERVAL): month_day_nano_interval
# ----
# CAST('2 year' AS INTERVAL): [[24M0d0ns]]
Mytherin commented 2 months ago

Actually since both duckdb-rs and duckdb-wasm use the Rust arrow library, is it perhaps possible the problem is in the Rust Arrow library?

Mause commented 2 months ago

That's possible yeah, I'd assumed that wasm used arrow-js, but it does seem to use arrow-rs yes

Max commented in Discord:

Sounds like a lifetime issue with the arrow return values then, might be some arrow array destructor we set too early or something. Will have a look

Mause commented 2 months ago

Looks like they did have issues here at one point too: https://github.com/apache/arrow-rs/pull/2235/files

Mytherin commented 2 months ago

While possible a lifetime issue seems unlikely to me here - the problem here seems to be that 24 months are interpreted as 24 seconds, and the issue appears to be consistent.

david542542 commented 2 months ago

Seems like it is wrong for all time units. See here.

image
Mytherin commented 2 months ago

Looks like wasm uses a very old version of arrow-rs. @carlopi perhaps we can try to upgrade this?

carlopi commented 2 months ago

Using arrow-js (basically when not using the shell infrastructure but passing arrow results directly to JS) the results are correctly parsed generated.

I will look at bumping arrow-rs.

Maxxen commented 2 months ago

Sorry, I said I was going to look into this but then went on vacation and been distracted with other things to get in before the release since I got back. Ill investigate more tomorrow.

Maxxen commented 2 months ago

This occurs in the rust client as well, even with latest arrow.

Mytherin commented 2 months ago

My guess at this point is that this is a bug within arrow-rs related to ingestion of intervals through the Arrow C API. Perhaps we can try to create a minimal example and file an issue in their repo?

carlopi commented 2 months ago

If @Maxxen would be up to it, I think given duckdb-rs is easier to test in isolation (also for the arrow people) and it's simpler to trust since there are less layers of things going wrong wrt duckdb-wasm's shell, I think that would work better as reproducer (and it's not on me!).

Maxxen commented 2 months ago

Actually, I think we might be in the wrong here. Arrow stores the interval in a 128bit integer, interpreted as multiple fields;

┌──────────────────────────────┬─────────────┬──────────────┐
│            Nanos             │    Days     │    Months    │
│          (64 bits)           │ (32 bits)   │  (32 bits)   │
└──────────────────────────────┴─────────────┴──────────────┘
  0                            63            95           127 bit offset

But our ArrowInterval type looks like this:

struct ArrowInterval {
    int32_t months;
    int32_t days;
    int64_t nanoseconds;
}

Which we just reinterpret straight up when writing out the arrow buffer.

   68                                   continue;
   69                           }
   70                           result_data[result_idx] = OP::template Operation<TGT, SRC>(data[source_idx]);
   71                   }
   72                   append_data.row_count += size;
   73           }
   74   };
(lldb) p result_data[result_idx]
(duckdb::ArrowInterval)  (months = 24, days = 0, nanoseconds = 0)

But thats not the right order when reinterpreting as a i128 (which is what rust is doing on the other side)

(lldb) p *(__int128*)&result_data[result_idx]
(__int128) 24
    pub fn to_parts(
        i: <IntervalMonthDayNanoType as ArrowPrimitiveType>::Native,
    ) -> (i32, i32, i64) {
        let months = (i >> 96) as i32;
        let days = (i >> 64) as i32;
        let nanos = i as i64;
        (months, days, nanos)
    }

Im not sure why this works in python though...

Mytherin commented 2 months ago

The rust code looks wrong to me. The definition is in little-endian. i >> 96 means we take the most significant bits for the month, when it should be the least-significant bits. Here are some other definitions in the official Arrow project:

It also doesn't really make sense for the Rust driver to be correct when it is the one that disagrees with everything else. Either all DuckDB, Arrow-C++, Arrow-Python, Arrow-Go, Arrow-R and Arrow-JS are wrong, or Arrow-RS is wrong. The latter seems more logical to me.

Mause commented 2 months ago

Are you saying it was correct before this change? https://github.com/apache/arrow-rs/pull/2235

Mause commented 2 months ago

Where's this diagram from @Maxxen ?

┌──────────────────────────────┬────────────┬──────────────┐
│            Nanos             │    Days     │    Months    │
│          (64 bits)           │ (32 bits)   │  (32 bits)   │
└──────────────────────────────┴────────────┴──────────────┘
  0                            63            95           127 bit offset

EDIT: Disregard, it's the same as https://github.com/apache/arrow-rs/pull/2235/files#diff-c47b2e57ea72cdad6427c3149de134be3437e403984d367e3604cd0943a9a963R288, just backwards

All integers in the types below are stored in the endianness indicated by the schema.

Maxxen commented 2 months ago

@Mause The diagram is from here: https://github.com/apache/arrow-rs/blob/36a6e515f99866eae5332dfc887c6cb5f8135064/arrow-array/src/types.rs#L265-L269

I agree seems like the rust def is wrong. So what now?

Mytherin commented 2 months ago

I would say we should double-check that the rust version is wrong and try to create a reproducer separate from DuckDB, then file a bug report in the arrow-rs repository.

Mytherin commented 2 months ago

Here's a stand-alone C++ program showcasing the problem:

#include <stdint.h>
#include <stdio.h>

using uint128_t = __uint128_t;

struct MonthDayNanos {
    int32_t months;
    int32_t days;
    int64_t nanoseconds;
};

int main() {
    MonthDayNanos ival;
    ival.months = 32;
    ival.days = 64;
    ival.nanoseconds = 128;

    uint128_t u128val = *reinterpret_cast<uint128_t *>(&ival);

    printf("Rust code (incorrect)\n");
    printf("months %u\n", uint32_t(u128val >> 96));
    printf("days %u\n", uint32_t(u128val >> 64));
    printf("nanos %llu\n", uint64_t(u128val));

    printf("\n");

    printf("Correct shifts\n");
    printf("months %u\n", uint32_t(u128val));
    printf("days %u\n", uint32_t(u128val >> 32));
    printf("nanos %llu\n", uint64_t(u128val >> 64));
}

Prints:

Rust code (incorrect)
months 0
days 128
nanos 274877906976

Correct shifts
months 32
days 64
nanos 128
david542542 commented 2 months ago

Hi, is there any update on this? Should I file a bug report in arrow-rs, or what is necessary in getting this resolved?

pitrou commented 2 months ago

Arrow C++ has https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/cpp/src/arrow/type.h#L1775-L1784

Arrow Java uses the same layout: https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/java/vector/src/main/java/org/apache/arrow/vector/IntervalMonthDayNanoVector.java#L171-L181

Mytherin commented 2 months ago

Arrow C++ has https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/cpp/src/arrow/type.h#L1775-L1784

Arrow Java uses the same layout: https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/java/vector/src/main/java/org/apache/arrow/vector/IntervalMonthDayNanoVector.java#L171-L181

That's mentioned above, the Arrow C++ layout is:

struct MonthDayNanos {
    int32_t months;
    int32_t days;
    int64_t nanoseconds;
};

The Rust code for parsing that as a uint128_t is as follows:

    pub fn to_parts(
        i: <IntervalMonthDayNanoType as ArrowPrimitiveType>::Native,
    ) -> (i32, i32, i64) {
        let months = (i >> 96) as i32;
        let days = (i >> 64) as i32;
        let nanos = i as i64;
        (months, days, nanos)
    }

As shown in my stand-alone program here - https://github.com/duckdb/duckdb-wasm/issues/1696#issuecomment-2047272977 - those shifts produce the wrong results given the Arrow C++ layout.

Mytherin commented 2 months ago

I noticed the integration tests linked involve files - perhaps the byte order for Arrow files is different from the in-memory byte order (big-endian vs little-endian)?

pitrou commented 2 months ago

Yes, sorry, I was just posting this for reference.

pitrou commented 2 months ago

I noticed the integration tests linked involve files - perhaps the byte order for Arrow files is different from the in-memory byte order (big-endian vs little-endian)?

No, it's probably just that arrow-rs has the same misinterpretation bug in all code paths. So when you ask it to roundtrip the data from JSON to IPC, for example, it will produce the intended IPC results.

Mytherin commented 2 months ago

Makes sense, thanks for picking this up!

pitrou commented 2 months ago

Well, you'll have to thank @tustvold for that :-)

david542542 commented 1 month ago

@Mytherin brief update, about ~two weeks away: https://github.com/apache/arrow-rs/issues/5654#issuecomment-2108732883

Maxxen commented 1 month ago

Thanks for driving this @david542542!

david542542 commented 1 month ago

Hi @Mytherin + @Maxxen,

It looks like this has been released (or awaiting release): https://github.com/apache/arrow-rs/issues/5688.

Are we able to add that item into the upcoming DuckDB bug-fix release?

carlopi commented 1 month ago

Thanks everyone for pushing forward on this.

I have a local branch that bumps arrow-rs version, will need to double check a few things, but ideally this can be soon fixed uptream.

david542542 commented 2 weeks ago

@carlopi cool, will this be in the next DuckDB release (1.1.0), or when will this fix be added in?