Eventual-Inc / Daft

Distributed data engine for Python/SQL designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.34k stars 164 forks source link

Parquet reader support for RLE-encoded boolean columns #3329

Open uditrana opened 3 days ago

uditrana commented 3 days ago

Describe the bug

Daft doesnt support some feature in the parquet file format for boolean columns.

To Reproduce

import polars as pl
import daft

df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [True, False, None, False, None]},
)
df.write_parquet("data/tmp_dataset/tmp.parquet")
display(df)
daft.read_parquet("data/tmp_dataset/tmp.parquet").collect()
shape: (5, 3)
┌─────┬─────┬───────┐
│ a   ┆ b   ┆ c     │
│ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ bool  │
╞═════╪═════╪═══════╡
│ 1   ┆ 5   ┆ true  │
│ 2   ┆ 4   ┆ false │
│ 3   ┆ 3   ┆ null  │
│ 4   ┆ 2   ┆ false │
│ 5   ┆ 1   ┆ null  │
└─────┴─────┴───────┘

DaftCoreException: DaftError::External Unable to create arrow chunk from streaming file readerdata/tmp_dataset/tmp.parquet: Not yet implemented: Decoding Boolean "Rle"-encoded optional  parquet pages

Expected behavior

You can read these boolean columns

Component(s)

Parquet

Additional context

No response

uditrana commented 3 days ago

This seems to work though:

import polars as pl
import daft

df = pl.DataFrame(
    {"a": [1, 2, 3, 4, 5], "b": [5, 4, 3, 2, 1], "c": [True, False, None, False, None]},
)
display(df)

display(
    daft.from_arrow(df.to_arrow())
)
desmondcheongzx commented 3 days ago

Ah, seems we haven't added RLE decoding support for booleans. Leaving a note for myself to implement this in src/arrow2/src/io/parquet/read/deserialize/boolean/basic.rs