Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
Using the cache with these two functions would create a collision and both would return pd.Timestamp("2021-01-01") on the 2nd execution (i.e., when retrieving values from cache)
The source of the bug is an oddity of pandas. Some objects have an empty __dict__ attached. A one line condition check now properly displays the intended warning messages.
When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION where data version is now a random UUID
Investigation
the reproduction worked on my machine
dr.cache.code_versions show that first() and second() are produce different code_version hash
dr.cache.data_versions produce the same data_version hash (source of the bug)
Lets look at the per-type hashing functions in hamilton.caching.fingerprinting
hash_pandas_obj() expects pd.Series or pd.DataFrame and doesn't handle pd.Timestamp (it would handle a series of pd.Timestamp values without collisions though)
The data_versions match the result of the default implementation hash_value() and falls under hash_value(pd.Timestamp(...).__dict__)
It seems that pd.Timestamp(...).__dict__ == {}, which is an odd behavior by pandas (it shouldn't be defined instead of empty)
Currently, hash_value() handles objects without a __dict__ (i.e., uses slots), but doesn't account for empty .__dict__ attribute
side notes
Warnings are only shown when hashing the value, meaning it's typically only displayed on first execution (value is retrieved on subsequent executions). This is a desirable behavior.
it's important to clear the on-disk cache when debugging this; can use in-memory caching for simplicity
Fixes #1242; copying reply from the issues thread
Problem
Using the cache with these two functions would create a collision and both would return
pd.Timestamp("2021-01-01")
on the 2nd execution (i.e., when retrieving values from cache)Solution
The source of the bug is an oddity of
pandas
. Some objects have an empty__dict__
attached. A one line condition check now properly displays the intended warning messages.When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is
NODE_NAME-CODE_VERSION-DATA_VERSION
where data version is now a random UUIDInvestigation
dr.cache.code_versions
show thatfirst()
andsecond()
are produce differentcode_version
hashdr.cache.data_versions
produce the samedata_version
hash (source of the bug)hamilton.caching.fingerprinting
hash_pandas_obj()
expectspd.Series
orpd.DataFrame
and doesn't handlepd.Timestamp
(it would handle a series ofpd.Timestamp
values without collisions though)data_versions
match the result of the default implementationhash_value()
and falls underhash_value(pd.Timestamp(...).__dict__)
pd.Timestamp(...).__dict__ == {}
, which is an odd behavior by pandas (it shouldn't be defined instead of empty)hash_value()
handles objects without a__dict__
(i.e., uses slots), but doesn't account for empty.__dict__
attributeside notes