Open dakenblack opened 2 years ago
duplicate of https://github.com/dask/dask/issues/8666 ?
Hi @martindurant, thankls for your response but I don't think it is a duplicate because
I've also further minimised my example.
I've also noticed, that only the partitioned columns with a single value are missing. If for example I changed my dataset to the one below, my code succeeds.
df = pd.DataFrame([
('A', 'B', 1),
('A', 'C', 3),
('B', 'C', 3),
], columns=['group1', 'group2', 'value'])
@rjzamora , the fp engine allows a base_path
argument to be able to infer the top of the parquet dataset tree correctly for this case, but I don't see how to pass it via dd.read_parquet.
The situation is, that the root of the parquet dataset is not obvious in the case that there is no _metadata, and the top level of partitioning only has one option.
What happened:
dast.read_parquet
is not able to find partioned column names when metadata is not available using the fastparquet engine. pandas (with fastparquet engine) and dask (with pyarrow engine) is able to find it though. The problem appears to lie with dask's fastparquet engine.I've removed metadata, because for my usecase I find it is faster to update the parquet dataset without metadata. (i.e. periodic updates of metadata is too expensive).
What you expected to happen: All columns to be found by dask.
Minimal Complete Verifiable Example:
Anything else we need to know?: I've also tried specifying the columns to load in the columns argument, but that returns an error:
Environment: