Closed jorisvandenbossche closed 4 weeks ago
With this type, are the values still python strings?
The values are either object-dtype with python strings (or np.nan for missing values) or either a pyarrow array, depending on the .storage
attribute of the dtype.
(and we will default to use pyarrow if it is installed)
But, regardless of the exact storage, if you just want to have Python strings you can always do something like to_numpy(dtype=object)
and then you don't have to care about the exact storage
if you just want to have Python strings
I want to pre-allocate a dataframe and fill in the values as they are read. That model probably doesn't work anymore for arrow-backed data more complex than the equivalent numpy array.
https://github.com/dask/fastparquet/pull/931 shows the possible future evolution of fastparquet where we no longer use pandas at all...
(FWIW, pandas is not going to hard require pyarrow for pandas 3.0, that decision is postponed until a later release. But regardless of that, having less pandas-specific code here sounds certainly worthwhile)
Preallocating probably won't work for the arrow-backed data indeed. But I would say you can always read the strings as you do now (preallocating an object-dtype array, I assume?) and do any conversion afterwards (or leave that to pandas to do so)
you can always read the strings as you do now
Probably we'll continue to produce numpy object columns while we can, but we still have to deal with the str
type when writing.
I'll get back to you on the two issues, thanks for letting me know.
Pandas decided to introduce a default string dtype (which will be used by default instead of object-dtype when inferring values to be strings), see https://pandas.pydata.org/pdeps/0014-string-dtype.html for the details (and https://github.com/pandas-dev/pandas/issues/54792 for progress of implementation).
This is already available in the
main
branch of pandas (and will also be in am upcoming 2.3 release) behind a feature flagpd.options.future.infer_string = True
.Right now, if you enable this flag (with nightly version of pandas) and use fastparquet to write a dataframe with a string column, this errors as follows (because fastparquet is not yet aware of the new dtype):