Closed mlineen closed 5 months ago
I think we should:
rechunk: true | false
option on read_parquet
as that may affect performancePRs are definitely welcome although I am not sure we could test this trivially. :( PRs would be welcome regardless.
I have a large (~771M, 7,421,520 row, 78 column) parquet file from a vendor that I'm able to read in with
Explorer.DataFrame.from_parquet
, but I am unable to take the loaded data frame and dump/write withExplorer.DataFrame.dump_parquet
norExplorer.DataFrame.to_parquet
When I try, I get
(ErlangError) Erlang error: :nif_panicked
onpolars-arrow-0.38.3/src/chunk.rs:20:31
If I run
as_single_chunk
in rust code, I am able to dump/write the file.If I run
set_rechunk
in rust code, when reading the file, I am able to dump/write the file.Would the project be open to adding a
DataFrame.as_single_chunk
method and/or addingset_rechunk
as an option toDataFrame.read_parquet
? What does adding either of these methods mean in the context of LazyFrame backend?Would anyone have a good idea of how to generate synthetic data that would exhibit this issue, as I cannot pass along the file I have?