dask / dask-expr

BSD 3-Clause "New" or "Revised" License
83 stars 22 forks source link

Predicate validation in parquet could happen before compute #753

Open phofl opened 8 months ago

phofl commented 8 months ago
    ddf = dd.from_dict(
        {"A": range(8), "B": [1, 1, 2, 2, 3, 3, 4, 4]},
        npartitions=4,
    )
    ddf.to_parquet(tmp_path, engine=engine)

    with pytest.raises(ValueError, match="not a valid operator in predicates"):
        unsupported_op = [[("B", "not eq", 1)]]
        dd.read_parquet(tmp_path, engine=engine, filters=unsupported_op)

Ideally, this would raise before we trigger compute

phofl commented 8 months ago

cc @rjzamora if you have thoughts where this belongs

rjzamora commented 8 months ago

It should raise once dask-expr actually passes the filters to pyarrow, but this won't happen until we need to know the number of partitions. So, dd.read_parquet(...).compute() should raise. If this is not the case, then there is indeed a bug somewhere.

rjzamora commented 8 months ago

Just to clarify: We can still add an extra/earlier validation step to dask-expr. I'll take a look to see where that would be easiest.

phofl commented 8 months ago

Sorry, you are completely correct, it's raising when I call compute.

Ideally we would fail earlier, but this is by no means urgent, I'll change the title