Closed poldpold closed 1 hour ago
Will investigate today! The issue is more likely related to the handling of pd.Timestamp
objects than the functions first()
and second()
per se.
dr.cache.code_versions
show that first()
and second()
are produce different code_version
hashdr.cache.data_versions
produce the same data_version
hash (source of the bug)hamilton.caching.fingerprinting
hash_pandas_obj()
expects pd.Series
or pd.DataFrame
and doesn't handle pd.Timestamp
(it would handle a series of pd.Timestamp
values without collisions though)data_versions
match the result of the default implementation hash_value()
and falls under hash_value(pd.Timestamp(...).__dict__)
pd.Timestamp(...).__dict__ == {}
, which is an odd behavior by pandas (it shouldn't be defined instead of empty)hash_value()
handles objects without a __dict__
(i.e., uses slots), but doesn't account for empty .__dict__
attributeChange the check for the base case in hash_value()
to handle .__dict__
that are empty. A one line condition check now properly display the intended warning messages. Instead of hashing the value, a random UUID is provided. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION
where data version is now a random UUID
Current behavior
It appears that when activating caching for two functions with the same signature, in the same file, when those functions are similar enough, they are mapped onto the same cache key. This can be seen in the cache directory, where only one cache file is created, and on the rerun of the DAG where both nodes receive the same cached value.
Stack Traces
There is no crash.
Steps to replicate behavior
Create and run a jupyter notebook with the following cells. (the issue is also present in actual modules, outside jupyter)
As one can see, one the rerun the
second
timestamp gets the cache value of thefirst
variable, as if function name was not part of the cache key.Library & System Information
python=3.11.8, sf-hamilton=1.83.2
Expected behavior
I would expect the result