Open thinkharderdev opened 2 years ago
Thanks @thinkharderdev for the report, this is due to: https://github.com/apache/arrow-datafusion/blob/68db579181bd826e6ab6cd659f52d443b950eaa5/datafusion/src/datasource/listing/table.rs#L128-L135
The fix should be pretty straight forward, we need a new constructor method for listing table for plan ser/de, which will just take the filed arguments and store them in the new struct as is without extra logic. Then we just need to call that new constructor method in https://github.com/apache/arrow-datafusion/blob/68db579181bd826e6ab6cd659f52d443b950eaa5/ballista/rust/core/src/serde/logical_plan/from_proto.rs#L200.
Thanks @thinkharderdev for the report, this is due to:
The fix should be pretty straight forward, we need a new constructor method for listing table for plan ser/de, which will just take the filed arguments and store them in the new struct as is without extra logic. Then we just need to call that new constructor method in
.
Cool. I can take a crack at that. Thanks!
Describe the bug Trying to read a partitioned parquet dataset while still allowing predicate pushdown on partition columns, I am manually constructing a table scan Logical plan on a manually constructed
ListingTable
which specified the partition column(s). TheListingTable
constructor will add the partition columns to the Schema. This is then serialized and sent to the ballista scheduler which will deserialize and construct a newListingTable
, which will again add the partition column to the schema and result in an error when constructing theDFSchema
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen. This should work and the planner should pushdown a filter on
my-parition-column
to the physical scan so we only read parquet files from the requested partitions.Additional context Add any other context about the problem here. A simple way to fix this would be to check in the
ListingTable
constructor whether we already have the partition columns included in the schema:When I try this locally it works in the sense that I don't get an error for duplicate fields, but I do get another error downstream. My guess is that this is because the partition column datatype is hard-coded but haven't debugged it fully.