Open r-stiller opened 2 years ago
I would say that "corrupted" is not correct here, since you do not get incorrect data, but, rather, a workflow that doesn't complete due to an exception. (sorry, I know this is just semantics)
I would say that the behaviour in this case is reasonable, if not exactly as you were expecting. Partitioning a dataset on load is indeed casting the data type to catagorical - I'm not sure if this is explicitly mentioned anywhere in the docs. For the (common) case of a string field, this is of great advantage in memory footprint on read. It is, perhaps, surprising that pandas will not concatenate these dataframes.
On write, the original dtype of the columns used for partitioning is saved in metadata, so it would be to reasonable request that the loaded dataset should have exactly the same types as the original.
May I suggest that the following workaround is a reasonable way to cope? This should not do any unnecessary copying.
pd.concat([df1_load.astype(df1.dtypes), df2])
Thank you for the reply. You're right corrupted is the wrong term here.
I do understand that casting into a categorical has it's benefits and I wouldn't have noticed if the CategoricalBlock had the correct size. I think that point should be fixed. (the dtype of the copy ist still categorical but concatenate works).
Maybe the type casting should be added to the docs, too.
I had to change your workaround a little bit to pd.concat([df1_load[:], df2])
since my df1 is not always available (For e.g. when starting my program and only load data from disk).
I do understand that casting into a categorical has it's benefits and I wouldn't have noticed if the CategoricalBlock had the correct size.
Oh, I see. That would be something in fastparquet.df.empty
, but probably we'd need to pull in someone from pandas to get the incantation correct. As you can see in that code, categoricals are explicitly handles in a couple of different places, and I'm really not sure where this behaviour might be coming from.
cc @jbrockmendel
that size change definitely looks wonky. I can reproduce it locally with pandas 1.3.4 but not with master (we're looking at 1.3.5 in the next few days and 1.4 late Dec).
Is fp constructing the CategoricalBlock directly or maybe pinning its .values
attribute after __init__
?
If it's not an issue on upcoming pandas, and we have a workaround here, I am tempted no to take any action.
What happened: Error when working with a pandas.DataFrame that has been loaded from a partitioned parquet file.
What you expected to happen: Non corrupted DataFrame.
Minimal Complete Verifiable Example:
Anything else we need to know?: The DataFrame works again when making a copy of it
df1_loaded = df1_loaded.copy()
. When you compare the__dicts__
of the original and the copy, you can see that the CategoricalBlock size has changed from 0 to 2:I guess this error is a direct consequence of #653 since the dtype of row 'B' is changed from int64 to categorical.
Environment: