Open Wainberg opened 7 months ago
It looks like R maps difftime
objects as an Arrow DurationType
, but DurationType
stores the values as an int64, and so the R values are being casted to integers. This is happening on the C++ side:
One workaround would be to convert the values to numerics before creating the Array:
> delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
> arrow::Array$create(as.numeric(delta))
Array
<double>
[
0,
0.001,
0.002,
1,
1.5
]
Maybe the r package could provide a warning about the casting of difftime to integers to the user.
difftime doesn't get any smaller than seconds, so anything smaller than a second will always be truncated. That needs a fix, not a warning!
It looks like the actual truncation happens here:
The cast to the c_type
as David noted, is a cast to an int64. On this line, one could check that you're not doing any truncation (I think you can use std::modf()
for that). You would probably have to do something like count the number of lossy casts (e.g., this->n_lossy_casts_++
) and issue the warning at the very end of the conversion.
Perhaps the underlying cause is that we infer the unit of "seconds" by default. We could infer "milliseconds" or "microseconds" which would avoid truncation (or would limit it to thousandths or millionths of a second). I don't know why "seconds" is the default but a good fix for this might be to change it to "ms" or "us" (or add an options()
to do so, perhaps migrating to a safer default over several versions with some warnings).
arrow::infer_type(as.difftime(double(), units = "secs"))
#> DurationType
#> duration[s]
A workaround could be to specify the type explicitly:
delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
delta |>
arrow::as_arrow_array(type = arrow::duration("ms"))
#> Array
#> <duration[ms]>
#> [
#> 0,
#> 1,
#> 2,
#> 1000,
#> 1500
#> ]
It looks like I inferred "microseconds" by default in nanoarrow although I forget the reasoning:
library(nanoarrow)
delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
delta |>
as_nanoarrow_array() |>
arrow::as_arrow_array()
#> Array
#> <duration[us]>
#> [
#> 0,
#> 1000,
#> 2000,
#> 1000000,
#> 1500000
#> ]
Not sure where this landed but if it's not going to warn about the type casting it would be helpful to mention it in the documentation. I can submit a PR if you'd like.
Truncating times to the nearest second isn't ok! That's not reasonable behavior!
The default should be to convert to nanoseconds, like arrow::as_arrow_array(type = arrow::duration("ns"))
. 2^64 ns is ~585 years, so you don't generally have to worry about overflow.
Related issue #32693
Describe the bug, including details regarding any error messages, version, and platform.
This should be [0, 0.001, 0.002, 1, 1.5], not [0, 0, 0, 1, 1]. Same issue with
arrow::Array$create(hms::hms(c(0.000, 0.001, 0.002, 1, 1.5)))
, which becomestime32[s]
.R version 4.3.2, R Arrow version 14.0.1.
Component(s)
R