elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

Fix datetime encoding overflow #1020

Closed billylanchantin closed 12 hours ago

billylanchantin commented 2 days ago

Fixes one of the #1014 issues with printing dataframes.

This one stems from how we encode datetimes. Example:

dtypes = [{"col_j", {:naive_datetime, :millisecond}}]
rows = [[{"col_j", ~N[8332-06-03 16:03:40.489446]}]]
df = DF.new(rows, dtypes: dtypes)
DF.print(df)

Panics with:

thread '<unnamed>' panicked at src/encoding.rs:167:35:
attempt to multiply with overflow
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at /Users/billy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-arrow-0.43.1/src/temporal_conversions.rs:159:37:
invalid or out-of-range datetime

  1) test specific property failures 1 (Explorer.DataFrameTest)
     test/explorer/data_frame_test.exs:4850
     ** (ErlangError) Erlang error: :nif_panicked
     code: DF.print(df)
     stacktrace:
       (explorer 0.11.0-dev) Explorer.PolarsBackend.Native.s_to_list(#Explorer.PolarsBackend.Series<
  #Reference<0.2007251331.1252655135.131382>
>)
       (explorer 0.11.0-dev) lib/explorer/polars_backend/shared.ex:24: Explorer.PolarsBackend.Shared.apply_series/3
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5981: anonymous fn/2 in Explorer.DataFrame.non_empty_table_string/2
       (elixir 1.16.0) lib/enum.ex:1700: Enum."-map/2-lists^map/1-1-"/2
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5979: Explorer.DataFrame.non_empty_table_string/2
       (explorer 0.11.0-dev) lib/explorer/data_frame.ex:5945: Explorer.DataFrame.print/2
       test/explorer/data_frame_test.exs:4854: (test)

The culprit is how we encode datetimes: we always turn them into microseconds first. This is usually fine. But with certain operations like printing, especially when the datetime is far from the unix epoch, it can result in overflow.

My proposed fix is to build the i64 representation directly from the (%DateTime{}, time_unit) pair using (mostly) Polars functions. This works -- and I added some properties to be confident -- but it has the drawback of us no longer being able to support impl From<ExNaiveDateTime> for i64 since we need to know which time unit the user wants to represent the datetime with. We didn't use that impl much so I think the drawback is acceptable.

We're also handling the case where nanosecond precision datetimes can't be represented more explicitly.

billylanchantin commented 1 day ago

Turns out this branch fixes more than I thought. main is not encoding :nanosecond datetimes correctly:

MIX_ENV=test iex -S mix # test env to get access to non-UTC timezones
iex> dt = %DateTime{year: 2017, month: 11, day: 7, zone_abbr: "CET",
...>                hour: 11, minute: 45, second: 18, microsecond: {123456, 6},
...>                utc_offset: 3600, std_offset: 0, time_zone: "Europe/Paris"}
#DateTime<2017-11-07 11:45:18.123456+01:00 CET Europe/Paris>

# This branch (correct)
iex> [dt] |> Explorer.Series.from_list(dtype: {:datetime, :nanosecond, dt.time_zone})
#Explorer.Series<
  Polars[1]
  datetime[ns, Europe/Paris] [2017-11-07 11:45:18.123456+01:00 CET Europe/Paris]
>

# main (way off)
iex> [dt] |> Explorer.Series.from_list(dtype: {:datetime, :nanosecond, dt.time_zone})
#Explorer.Series<
  Polars[1]
  datetime[ns, Europe/Paris] [1970-01-18 11:27:35.118123+01:00 CET Europe/Paris]
>