apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.35k stars 3.49k forks source link

[R] Initializing an Arrow array from an R difftime or hms::hms rounds down to the nearest time unit #40109

Open Wainberg opened 7 months ago

Wainberg commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

> delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
> delta
Time differences in secs
[1] 0.000 0.001 0.002 1.000 1.500
> arrow::Array$create(delta)
Array
<duration[s]>
[
  0,
  0,
  0,
  1,
  1
]

This should be [0, 0.001, 0.002, 1, 1.5], not [0, 0, 0, 1, 1]. Same issue with arrow::Array$create(hms::hms(c(0.000, 0.001, 0.002, 1, 1.5))), which becomes time32[s].

R version 4.3.2, R Arrow version 14.0.1.

Component(s)

R

dgreiss commented 7 months ago

It looks like R maps difftime objects as an Arrow DurationType, but DurationType stores the values as an int64, and so the R values are being casted to integers. This is happening on the C++ side:

https://github.com/apache/arrow/blob/909f6f90ebdbb2e16b223d768ee10c78e0b37bfb/cpp/src/arrow/type.h#L1811-L1817

One workaround would be to convert the values to numerics before creating the Array:

> delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
> arrow::Array$create(as.numeric(delta))
Array
<double>
[
  0,
  0.001,
  0.002,
  1,
  1.5
]

Maybe the r package could provide a warning about the casting of difftime to integers to the user.

Wainberg commented 7 months ago

difftime doesn't get any smaller than seconds, so anything smaller than a second will always be truncated. That needs a fix, not a warning!

paleolimbot commented 7 months ago

It looks like the actual truncation happens here:

https://github.com/apache/arrow/blob/214378b522a36fbf6010e3d4f5470abaca7bf92e/r/src/r_to_arrow.cpp#L926

The cast to the c_type as David noted, is a cast to an int64. On this line, one could check that you're not doing any truncation (I think you can use std::modf() for that). You would probably have to do something like count the number of lossy casts (e.g., this->n_lossy_casts_++) and issue the warning at the very end of the conversion.

Perhaps the underlying cause is that we infer the unit of "seconds" by default. We could infer "milliseconds" or "microseconds" which would avoid truncation (or would limit it to thousandths or millionths of a second). I don't know why "seconds" is the default but a good fix for this might be to change it to "ms" or "us" (or add an options() to do so, perhaps migrating to a safer default over several versions with some warnings).

arrow::infer_type(as.difftime(double(), units = "secs"))
#> DurationType
#> duration[s]

A workaround could be to specify the type explicitly:

delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
delta |> 
  arrow::as_arrow_array(type = arrow::duration("ms"))
#> Array
#> <duration[ms]>
#> [
#>   0,
#>   1,
#>   2,
#>   1000,
#>   1500
#> ]

It looks like I inferred "microseconds" by default in nanoarrow although I forget the reasoning:

library(nanoarrow)

delta <- as.difftime(c(0.000, 0.001, 0.002, 1, 1.5), units = "secs")
delta |> 
  as_nanoarrow_array() |> 
  arrow::as_arrow_array()
#> Array
#> <duration[us]>
#> [
#>   0,
#>   1000,
#>   2000,
#>   1000000,
#>   1500000
#> ]
dgreiss commented 7 months ago

Not sure where this landed but if it's not going to warn about the type casting it would be helpful to mention it in the documentation. I can submit a PR if you'd like.

Wainberg commented 7 months ago

Truncating times to the nearest second isn't ok! That's not reasonable behavior!

The default should be to convert to nanoseconds, like arrow::as_arrow_array(type = arrow::duration("ns")). 2^64 ns is ~585 years, so you don't generally have to worry about overflow.

eitsupi commented 6 months ago

Related issue #32693