dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
779 stars 178 forks source link

Support upcoming default pandas string dtype (pandas >= 3) #930

Closed jorisvandenbossche closed 4 weeks ago

jorisvandenbossche commented 2 months ago

Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and https://github.com/pandas-dev/pandas/issues/54792 for progress of implementation).

This is already available in the main branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flag pd.options.future.infer_string = True.

Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):

In [1]: pd.options.future.infer_string = True

In [2]: df = pd.DataFrame({"a": ["some", "strings"]})

In [3]: df.dtypes
Out[3]: 
a    str
dtype: object

In [4]: df.to_parquet("test_new_string_dtype.parquet", engine="fastparquet")
...
File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:904, in make_metadata(data, has_nulls, ignore_columns, fixed_text, object_encoding, times, index_cols, partition_cols, cols_dtype)
    902     se.name = column
    903 else:
--> 904     se, type = find_type(data[column], fixed_text=fixed,
    905                          object_encoding=oencoding, times=times,
    906                          is_index=is_index)
    907 col_has_nulls = has_nulls
    908 if has_nulls is None:

File ~/conda/envs/dev/lib/python3.11/site-packages/fastparquet/writer.py:222, in find_type(data, fixed_text, object_encoding, times, is_index)
    218     type, converted_type, width = (parquet_thrift.Type.BYTE_ARRAY,
    219                                    parquet_thrift.ConvertedType.UTF8,
    220                                    None)
    221 else:
--> 222     raise ValueError("Don't know how to convert data type: %s" % dtype)
    223 se = parquet_thrift.SchemaElement(
    224     name=norm_col_name(data.name, is_index), type_length=width,
    225     converted_type=converted_type, type=type,
   (...)
    228     i32=True
    229 )
    230 return se, type

ValueError: Don't know how to convert data type: str
martindurant commented 2 months ago

With this type, are the values still python strings?

jorisvandenbossche commented 2 months ago

The values are either object-dtype with python strings (or np.nan for missing values) or either a pyarrow array, depending on the .storage attribute of the dtype. (and we will default to use pyarrow if it is installed)

jorisvandenbossche commented 2 months ago

But, regardless of the exact storage, if you just want to have Python strings you can always do something like to_numpy(dtype=object) and then you don't have to care about the exact storage

martindurant commented 2 months ago

if you just want to have Python strings

I want to pre-allocate a dataframe and fill in the values as they are read. That model probably doesn't work anymore for arrow-backed data more complex than the equivalent numpy array.

https://github.com/dask/fastparquet/pull/931 shows the possible future evolution of fastparquet where we no longer use pandas at all...

jorisvandenbossche commented 2 months ago

(FWIW, pandas is not going to hard require pyarrow for pandas 3.0, that decision is postponed until a later release. But regardless of that, having less pandas-specific code here sounds certainly worthwhile)

Preallocating probably won't work for the arrow-backed data indeed. But I would say you can always read the strings as you do now (preallocating an object-dtype array, I assume?) and do any conversion afterwards (or leave that to pandas to do so)

martindurant commented 2 months ago

you can always read the strings as you do now

Probably we'll continue to produce numpy object columns while we can, but we still have to deal with the str type when writing.

I'll get back to you on the two issues, thanks for letting me know.