07_dataframe_storage example fails to read parquet data back in

johncoxon commented 2 years ago

The code in the seventh chapter does not work to read the data back in from parquet after it's been written to disk.

What happened: The attempt to read the data back in from the parquet files fails.

What you expected to happen: The attempt to succeed so I could continue with the chapter examples.

MWE: (I've written this to illustrate the issue in a single block, but for me just following the cells in 07 reproduces the issue, too.)

import os
import dask.dataframe as dd

target = os.path.join('data', 'accounts.parquet')

print(os.listdir(target))

df_csv = dd.read_csv(filename)
print(df_csv.head())

df_csv.categorize(columns=['names']).to_parquet(target, storage_options={"has_nulls": True}, engine="fastparquet")

print(os.listdir(target))

df_p = dd.read_parquet(target)
print(df_p.head())

yields the following output:

[]
    id     names  amount
0   12   Michael    4675
1   95  Patricia      17
2  258    Xavier     612
3  435    Hannah     411
4  319       Tim     447
['_common_metadata', '_metadata', 'part.2.parquet', 'part.0.parquet', 'part.1.parquet']
Empty DataFrame
Columns: [id, names, amount]
Index: []

Environment:

Dask version: 2021.08.0
Python version: 3.9.10
Operating System: macOS Monterey 12.3
Install method (conda, pip, source): conda

bryanwweber commented 2 years ago

Thanks for reporting this! If I switch the engine to pyarrow on the to_parquet() step, the problem is resolved. The read_parquet() can be either fastparquet or pyarrow. I'm not sure why fastparquet is doing this, we'll need to investigate more, but at least that should get you unblocked!

martindurant commented 2 years ago

I have repeated your exact code (assuming filename = "data/accounts*.csv"), and encountered no problem. What version of fastparquet do you have?

Do the files written to the target directory have any size? Is there any error or warning?

storage_options={"has_nulls": True},

This does not look like a storage/filesystem option, but something to pass to the parquet writer; also, it should not be necessary, all columns are assumed to be nullable.

bryanwweber commented 2 years ago

@martindurant I also found the same thing,and got no errors or warnings... However, in the context of the tutorial, the next cell suggests implementing some code that requires the DataFrame not be empty, which I suspect led to this issue.

johncoxon commented 2 years ago

Thanks both! The parquet directory is three 25MB files in part.0.parquet and in part 1 and 2, so 75MB total.

jsignell commented 2 years ago

We have removed this chapter so I'll go ahead and close this issue.

dask / dask-tutorial

07_dataframe_storage example fails to read parquet data back in #234