elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.13k stars 123 forks source link

:nif_panicked "Chunk require all its arrays to have an equal number of rows" #914

Closed mlineen closed 5 months ago

mlineen commented 5 months ago

I have a large (~771M, 7,421,520 row, 78 column) parquet file from a vendor that I'm able to read in with Explorer.DataFrame.from_parquet, but I am unable to take the loaded data frame and dump/write with Explorer.DataFrame.dump_parquet nor Explorer.DataFrame.to_parquet

When I try, I get (ErlangError) Erlang error: :nif_panicked on polars-arrow-0.38.3/src/chunk.rs:20:31

If I run as_single_chunk in rust code, I am able to dump/write the file.

If I run set_rechunk in rust code, when reading the file, I am able to dump/write the file.

Would the project be open to adding a DataFrame.as_single_chunk method and/or adding set_rechunk as an option to DataFrame.read_parquet? What does adding either of these methods mean in the context of LazyFrame backend?

Would anyone have a good idea of how to generate synthetic data that would exhibit this issue, as I cannot pass along the file I have?

josevalim commented 5 months ago

I think we should:

  1. Automatically rechunk the file on to_parquet/dump_parquet
  2. Also add a rechunk: true | false option on read_parquet as that may affect performance

PRs are definitely welcome although I am not sure we could test this trivially. :( PRs would be welcome regardless.