Open onlyjsmith opened 1 day ago
hi @onlyjsmith - thank you for the issue! interesting, will take a look
just to document the process, i'm noticing that the hash of multiple objects is stable within the same process, but not between
@onlyjsmith ok!
The issue appears to occur because DataFrame
objects can't be directly JSON serialized, causing the hash to fall back to cloudpickle
which includes non-deterministic elements between runs.
in the short term, you can use the to_dict
method to dump the df:
data_dict = data.to_dict(orient="split")
combined_hash = hash_objects(data_dict, config)
I'll look more into what exactly is changing between runs here
these docs may be useful to you
@zzstoatzz thanks for the rapid reply and for digging in. The docs link is helpful, but can I check if using a DataFrame as a cache key would be viewed as a good approach (if it was reliable)?
We take as input a geospatial polygon (as a GeoPandas GeoDataFrame
) and use it to calculate a deterministic set of values for that area. The simplest would be to just pass the GeoDataFrame
into the function and have it hit the cache if it can.
can I check if using a DataFrame as a cache key would be viewed as a good approach (if it was reliable)?
I think it depends on when you want to invalidate cache. The simplest strategy in my mind would be something like the to_dict
snippet above, such that if the dataframe coming in was different than last time, that data_dict
would also be different, i.e. dataframes with novel values would invalidate your cache
Bug summary
data
andconfig
objects alone, but an inconsistent hash when putting them together.EDIT: should point out that I got to this after finding the Prefect cache wasn't being hit when I thought it should
Version info
Additional context