dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
787 stars 178 forks source link

Upcoming pandas (>2.2.0) raises "read-only" errors #919

Open martindurant opened 9 months ago

martindurant commented 9 months ago

No longer allows setting series values in-place. Thanks pandas.

jorisvandenbossche commented 9 months ago

You're welcome!

The returning of read-only numpy arrays is certainly one of the parts of the large CoW change (https://pandas.pydata.org/pdeps/0007-copy-on-write.html) we are least certain about. So feedback from downstream developers is certainly welcome.

I assume the issue here is because you allocate an empty dataframe first, and then get "view" arrays to write into. For the index, in one of the code paths that happens here:

https://github.com/dask/fastparquet/blob/eec9e614603f9be3cb495409ccb263caff53fe9d/fastparquet/dataframe.py#L156

The return value of .values is now a read-only numpy array (https://pandas.pydata.org/docs/user_guide/copy_on_write.html#read-only-numpy-arrays). You know you just created this data yourself, so you can safely change its writeable flag to True as a workaround.

And I suppose this only happens for the Index, because for columns you rely on the Block.values, where we didn't add this protection as this is regarded as internal anyway.


It's probably already covered by the failing tests you have in fastparquet's own test suite, but listing here some tests that are failing on the pandas side (they were being skipped with CoW enabled for some time, we should have reported that earlier):

# dataframe with a non-default (i.e. non-RangeIndex) index
df = pd.DataFrame({"A": [1, 2, 3]}, index=list("abc"))
df.to_parquet("test.parquet", engine="fastparquet")
pd.read_parquet("test.parquet", engine="fastparquet")
# probably same underlying issue; tz-aware datetime index
import datetime
idx = [datetime.datetime.now(datetime.timezone.utc)] * 5
df = pd.DataFrame(index=idx, data={"index_as_col": idx})
df.to_parquet("test.parquet", engine="fastparquet")
pd.read_parquet("test.parquet", engine="fastparquet")
martindurant commented 9 months ago

Thanks for the info, @jorisvandenbossche . Any idea of the release timeline?

jorisvandenbossche commented 8 months ago

The current goal is April