elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

cannot encode value Binary to term (when Binary is nested in struct) #994

Closed aymanosman closed 1 month ago

aymanosman commented 1 month ago

I'm getting this error with a dataset that contains Binary nested in a struct

thread '<unnamed>' panicked at src/encoding.rs:786:15: cannot encode value Binary([1,2.3]) to term

Steps to reproduce.

import polars as pl
df = pl.DataFrame({"image": [b'\x01\x02\x03']})
df.write_parquet("ok.parquet")
df = pl.DataFrame({"image": [{"bytes": b'\x01\x02\x03'}]})
df.write_parquet("bad.parquet")
iex(41)> Explorer.DataFrame.from_parquet!("ok.parquet")
#Explorer.DataFrame<
  Polars[1 x 1]
  image binary [<<1, 2, 3>>]
>

iex(42)> Explorer.DataFrame.from_parquet!("bad.parquet")
thread '<unnamed>' panicked at src/encoding.rs:783:15:
                                                      cannot encode value Binary([1, 2, 3]) to term
            note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  #Inspect.Error<
  got ErlangError with message:

      """
      Erlang error: :nif_panicked
      """

  while inspecting:

      %{
        data: %Explorer.PolarsBackend.DataFrame{
          resource: #Reference<0.315878801.3185704982.174156>
        },
        remote: nil,
        names: ["image"],
        __struct__: Explorer.DataFrame,
        dtypes: %{"image" => {:struct, [{"bytes", :binary}]}},
        groups: []
      }

  Stacktrace:

    (explorer 0.10.0-dev) Explorer.PolarsBackend.Native.s_to_list(#Explorer.PolarsBackend.Series<
  shape: (1,)
  Series: 'image' [struct[1]]
  [
    {b"\x01\x02\x03"}
  ]
>)
    (explorer 0.10.0-dev) lib/explorer/polars_backend/shared.ex:24: Explorer.PolarsBackend.Shared.apply_series/3
    (explorer 0.10.0-dev) lib/explorer/backend/data_frame.ex:324: anonymous fn/3 in Explorer.Backend.DataFrame.build_cols_algebra/3
    (elixir 1.17.2) lib/enum.ex:1703: Enum."-map/2-lists^map/1-1-"/2
    (explorer 0.10.0-dev) lib/explorer/backend/data_frame.ex:283: Explorer.Backend.DataFrame.inspect/5
    (explorer 0.10.0-dev) lib/explorer/data_frame.ex:6269: Inspect.Explorer.DataFrame.inspect/2
    (elixir 1.17.2) lib/inspect/algebra.ex:347: Inspect.Algebra.to_doc/2
    (elixir 1.17.2) lib/kernel.ex:2381: Kernel.inspect/2

>

FYI, the original dataset that cause the issue is https://huggingface.co/datasets/microsoft/cats_vs_dogs/resolve/main/data/train-00000-of-00002.parquet.

philss commented 1 month ago

I can confirm this is a bug on our side (of the Rust code). When encoding structs, we recursively call the term_from_value function that does not implement the encoding of AnyValue::Binary(bytes). This encoding is only implemented by the resource_term_from_value function that requires a ResourceArc as parameter.

I will investigate how to fix this issue. Thanks for the report!

ceyhunkerti commented 1 month ago

@philss https://github.com/elixir-explorer/explorer/pull/995 do you think this is ok to fix it ?

philss commented 1 month ago

@ceyhunkerti this is partially fixing it. It's not building a Binary term, but a List term. Maybe we need a wider change there. But I can confirm that the parquet will load with your change.