Closed david542542 closed 2 weeks ago
@Mytherin I think this is actually an issue with DuckDB core's arrow layer, as I see the same issue with duckdb-rs. I did briefly discuss this with @Maxxen as well
Why does the problem then not occur in Python?
>>> import duckdb
>>> duckdb.sql("SELECT INTERVAL '2 year'").arrow()
# pyarrow.Table
# CAST('2 year' AS INTERVAL): month_day_nano_interval
# ----
# CAST('2 year' AS INTERVAL): [[24M0d0ns]]
Actually since both duckdb-rs
and duckdb-wasm
use the Rust arrow library, is it perhaps possible the problem is in the Rust Arrow library?
That's possible yeah, I'd assumed that wasm used arrow-js, but it does seem to use arrow-rs yes
Max commented in Discord:
Sounds like a lifetime issue with the arrow return values then, might be some arrow array destructor we set too early or something. Will have a look
Looks like they did have issues here at one point too: https://github.com/apache/arrow-rs/pull/2235/files
While possible a lifetime issue seems unlikely to me here - the problem here seems to be that 24 months are interpreted as 24 seconds, and the issue appears to be consistent.
Seems like it is wrong for all time units. See here.
Looks like wasm uses a very old version of arrow-rs. @carlopi perhaps we can try to upgrade this?
Using arrow-js (basically when not using the shell infrastructure but passing arrow results directly to JS) the results are correctly parsed generated.
I will look at bumping arrow-rs.
Sorry, I said I was going to look into this but then went on vacation and been distracted with other things to get in before the release since I got back. Ill investigate more tomorrow.
This occurs in the rust client as well, even with latest arrow
.
My guess at this point is that this is a bug within arrow-rs
related to ingestion of intervals through the Arrow C API. Perhaps we can try to create a minimal example and file an issue in their repo?
If @Maxxen would be up to it, I think given duckdb-rs is easier to test in isolation (also for the arrow people) and it's simpler to trust since there are less layers of things going wrong wrt duckdb-wasm's shell, I think that would work better as reproducer (and it's not on me!).
Actually, I think we might be in the wrong here. Arrow stores the interval in a 128bit integer, interpreted as multiple fields;
┌──────────────────────────────┬─────────────┬──────────────┐
│ Nanos │ Days │ Months │
│ (64 bits) │ (32 bits) │ (32 bits) │
└──────────────────────────────┴─────────────┴──────────────┘
0 63 95 127 bit offset
But our ArrowInterval
type looks like this:
struct ArrowInterval {
int32_t months;
int32_t days;
int64_t nanoseconds;
}
Which we just reinterpret straight up when writing out the arrow buffer.
68 continue;
69 }
70 result_data[result_idx] = OP::template Operation<TGT, SRC>(data[source_idx]);
71 }
72 append_data.row_count += size;
73 }
74 };
(lldb) p result_data[result_idx]
(duckdb::ArrowInterval) (months = 24, days = 0, nanoseconds = 0)
But thats not the right order when reinterpreting as a i128 (which is what rust is doing on the other side)
(lldb) p *(__int128*)&result_data[result_idx]
(__int128) 24
pub fn to_parts(
i: <IntervalMonthDayNanoType as ArrowPrimitiveType>::Native,
) -> (i32, i32, i64) {
let months = (i >> 96) as i32;
let days = (i >> 64) as i32;
let nanos = i as i64;
(months, days, nanos)
}
Im not sure why this works in python though...
The rust code looks wrong to me. The definition is in little-endian. i >> 96
means we take the most significant bits for the month, when it should be the least-significant bits. Here are some other definitions in the official Arrow project:
It also doesn't really make sense for the Rust driver to be correct when it is the one that disagrees with everything else. Either all DuckDB, Arrow-C++, Arrow-Python, Arrow-Go, Arrow-R and Arrow-JS are wrong, or Arrow-RS is wrong. The latter seems more logical to me.
Are you saying it was correct before this change? https://github.com/apache/arrow-rs/pull/2235
Where's this diagram from @Maxxen ?
┌──────────────────────────────┬────────────┬──────────────┐
│ Nanos │ Days │ Months │
│ (64 bits) │ (32 bits) │ (32 bits) │
└──────────────────────────────┴────────────┴──────────────┘
0 63 95 127 bit offset
EDIT: Disregard, it's the same as https://github.com/apache/arrow-rs/pull/2235/files#diff-c47b2e57ea72cdad6427c3149de134be3437e403984d367e3604cd0943a9a963R288, just backwards
All integers in the types below are stored in the endianness indicated by the schema.
@Mause The diagram is from here: https://github.com/apache/arrow-rs/blob/36a6e515f99866eae5332dfc887c6cb5f8135064/arrow-array/src/types.rs#L265-L269
I agree seems like the rust def is wrong. So what now?
I would say we should double-check that the rust version is wrong and try to create a reproducer separate from DuckDB, then file a bug report in the arrow-rs repository.
Here's a stand-alone C++ program showcasing the problem:
#include <stdint.h>
#include <stdio.h>
using uint128_t = __uint128_t;
struct MonthDayNanos {
int32_t months;
int32_t days;
int64_t nanoseconds;
};
int main() {
MonthDayNanos ival;
ival.months = 32;
ival.days = 64;
ival.nanoseconds = 128;
uint128_t u128val = *reinterpret_cast<uint128_t *>(&ival);
printf("Rust code (incorrect)\n");
printf("months %u\n", uint32_t(u128val >> 96));
printf("days %u\n", uint32_t(u128val >> 64));
printf("nanos %llu\n", uint64_t(u128val));
printf("\n");
printf("Correct shifts\n");
printf("months %u\n", uint32_t(u128val));
printf("days %u\n", uint32_t(u128val >> 32));
printf("nanos %llu\n", uint64_t(u128val >> 64));
}
Prints:
Rust code (incorrect)
months 0
days 128
nanos 274877906976
Correct shifts
months 32
days 64
nanos 128
Hi, is there any update on this? Should I file a bug report in arrow-rs, or what is necessary in getting this resolved?
Arrow C++ has https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/cpp/src/arrow/type.h#L1775-L1784
Arrow Java uses the same layout: https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/java/vector/src/main/java/org/apache/arrow/vector/IntervalMonthDayNanoVector.java#L171-L181
Arrow C++ has https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/cpp/src/arrow/type.h#L1775-L1784
Arrow Java uses the same layout: https://github.com/apache/arrow/blob/7003e90c113cec58a620deddf56af71eb305af2a/java/vector/src/main/java/org/apache/arrow/vector/IntervalMonthDayNanoVector.java#L171-L181
That's mentioned above, the Arrow C++ layout is:
struct MonthDayNanos {
int32_t months;
int32_t days;
int64_t nanoseconds;
};
The Rust code for parsing that as a uint128_t
is as follows:
pub fn to_parts(
i: <IntervalMonthDayNanoType as ArrowPrimitiveType>::Native,
) -> (i32, i32, i64) {
let months = (i >> 96) as i32;
let days = (i >> 64) as i32;
let nanos = i as i64;
(months, days, nanos)
}
As shown in my stand-alone program here - https://github.com/duckdb/duckdb-wasm/issues/1696#issuecomment-2047272977 - those shifts produce the wrong results given the Arrow C++ layout.
I noticed the integration tests linked involve files - perhaps the byte order for Arrow files is different from the in-memory byte order (big-endian vs little-endian)?
Yes, sorry, I was just posting this for reference.
I noticed the integration tests linked involve files - perhaps the byte order for Arrow files is different from the in-memory byte order (big-endian vs little-endian)?
No, it's probably just that arrow-rs has the same misinterpretation bug in all code paths. So when you ask it to roundtrip the data from JSON to IPC, for example, it will produce the intended IPC results.
Makes sense, thanks for picking this up!
Well, you'll have to thank @tustvold for that :-)
@Mytherin brief update, about ~two weeks away: https://github.com/apache/arrow-rs/issues/5654#issuecomment-2108732883
Thanks for driving this @david542542!
Hi @Mytherin + @Maxxen,
It looks like this has been released (or awaiting release): https://github.com/apache/arrow-rs/issues/5688.
Are we able to add that item into the upcoming DuckDB bug-fix release?
Thanks everyone for pushing forward on this.
I have a local branch that bumps arrow-rs version, will need to double check a few things, but ideally this can be soon fixed uptream.
@carlopi cool, will this be in the next DuckDB release (1.1.0), or when will this fix be added in?
What happens?
Running
SELECT INTERVAL '2 year'
(or really any otherINTERVAL
) produces the wrong output in the wasm shell. It seems like it has to do with endianness: 2 years is 24 months and here 24 is put into fractional seconds, i.e at the other end of the number.To Reproduce
You can run this query here (click to open in wasm shell).
SELECT INTERVAL '2 day'
OS:
Mac > Chrome > DuckDB Wasm Shell
DuckDB Version:
v0.10.1
DuckDB Client:
Wasm
Full Name:
David Litwin
Affiliation:
None
Have you tried this on the latest nightly build?
I have tested with a nightly build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?