elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

LazyFrame not being able to cast dtypes #959

Closed halian-vilela closed 3 months ago

halian-vilela commented 3 months ago

Hey guys!

I'm doing some initial tests on trying to stream parquets from S3, but something is going wrong in the step where Polars try to cast the parquet's columns dtypes.

The call was pretty straight-forward:

s3_url = "s3://<all_relevant_paths>/id-part-0.parquet"
opts =  [
   access_key_id: <hidden>,
   secret_access_key: <hidden>,
   token: <hidden>,
   region: <my_region>,
 ]

Explorer.DataFrame.from_parquet(s3_url, config: opts, lazy: true)

But I'm getting this error:

** (MatchError) no match of right hand side value: {:error, "Generic Error: cannot cast to dtype: i32"}
    (explorer 0.7.2) lib/explorer/polars_backend/shared.ex:107: Explorer.PolarsBackend.Shared.df_dtypes/1
    (explorer 0.7.2) lib/explorer/polars_backend/shared.ex:87: Explorer.PolarsBackend.Shared.create_dataframe/1
    (explorer 0.7.2) lib/explorer/polars_backend/lazy_frame.ex:163: Explorer.PolarsBackend.LazyFrame.from_parquet/3
    iex:6: (file)

The content of the column in question is a simple int32 timestamp (which I've double checked and there isn't any artifact, string, empty or any bad datum likewise):

_prefix__date integer [1646697600, 1646697600, 1646697600, 1646697600, 1646697600, ...]

Some important information:

  1. I could confirm that upon downloading this same parquet and trying to load it directly from local file system, it worked flawlessly; so it ought to be something specific for lazyframes:
    Explorer.DataFrame.from_parquet("./id-part-0.parquet")
    {:ok, #Explorer.DataFrame< Polars[37131 x 16] ... >}

    1.1 Indeed, if I just add the lazy: true option, it gives me the error again:

    Explorer.DataFrame.from_parquet("./id-part-0.parquet", lazy: true)
    ** (MatchError) no match of right hand side value: {:error, "Generic Error: cannot cast to dtype: i32"}
    (explorer 0.7.2) lib/explorer/polars_backend/shared.ex:107: Explorer.PolarsBackend.Shared.df_dtypes/1
    (explorer 0.7.2) lib/explorer/polars_backend/shared.ex:87: Explorer.PolarsBackend.Shared.create_dataframe/1
    (explorer 0.7.2) lib/explorer/polars_backend/lazy_frame.ex:171: Explorer.PolarsBackend.LazyFrame.from_parquet/3
    iex:1: (file)
  2. Following the stack trace and dbging inside Explorer.PolarsBackend.Shared.df_dtypes/1 we can see that LazyFrame was already available:
    defp df_dtypes(%PolarsLazyFrame{} = polars_df) do
    dbg(polars_df)
    {:ok, dtypes} = Native.lf_dtypes(polars_df)
    dtypes
    end
    -----------------------
    [(explorer 0.7.2) lib/explorer/polars_backend/shared.ex:106: Explorer.PolarsBackend.Shared.df_dtypes/1]
    polars_df #=> %Explorer.PolarsBackend.LazyFrame{
    resource: #Reference<0.138873274.1073086465.28960>
    }

    So the following step is the root of all evil... 😢

I've even tried to take a look into its rust src but, skill issues aside (🤣 ); I can't identify anything obvious that could be failing:

#[rustler::nif]
pub fn lf_dtypes(data: ExLazyFrame) -> Result<Vec<ExSeriesDtype>, ExplorerError> {
    let mut dtypes: Vec<ExSeriesDtype> = vec![];
    let schema = data.clone_inner().schema()?;

    for dtype in schema.iter_dtypes() {
        dtypes.push(ExSeriesDtype::try_from(dtype)?)
    }

    Ok(dtypes)
}

Anything we could do about it? Thanks!

halian-vilela commented 3 months ago

Solved!

It seems it was an issue at version 0.7.1, but is already solved at 0.9.1!

Closing!