DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.88k stars 126 forks source link

Cache key collision for simple functions in the same module #1242

Closed poldpold closed 1 hour ago

poldpold commented 17 hours ago

Current behavior

It appears that when activating caching for two functions with the same signature, in the same file, when those functions are similar enough, they are mapped onto the same cache key. This can be seen in the cache directory, where only one cache file is created, and on the rerun of the DAG where both nodes receive the same cached value.

Stack Traces

There is no crash.

Steps to replicate behavior

Create and run a jupyter notebook with the following cells. (the issue is also present in actual modules, outside jupyter)

%load_ext hamilton.plugins.jupyter_magic
%%cell_to_module -m test_module --display --rebuild-drivers

from hamilton.function_modifiers import cache
import pandas as pd

def first() -> pd.Timestamp:
    return pd.Timestamp("2021-01-01")

def second() -> pd.Timestamp:
    return pd.Timestamp("2021-01-02")
from hamilton import driver
import test_module

dr = (
    driver
    .Builder()
    .with_config({})
    .with_modules(test_module)
    .with_cache(path=".")
    .build()
)
dr.execute(final_vars=["first", "second"])
>> {'first': Timestamp('2021-01-01 00:00:00'),
>> 'second': Timestamp('2021-01-02 00:00:00')}
dr.execute(final_vars=["first", "second"])
>> {'first': Timestamp('2021-01-01 00:00:00'),
>>  'second': Timestamp('2021-01-01 00:00:00')}

As one can see, one the rerun the second timestamp gets the cache value of the first variable, as if function name was not part of the cache key.

Library & System Information

python=3.11.8, sf-hamilton=1.83.2

Expected behavior

I would expect the result

>> {'first': Timestamp('2021-01-01 00:00:00'),
>>  'second': Timestamp('2021-01-02 00:00:00')}
zilto commented 11 hours ago

Will investigate today! The issue is more likely related to the handling of pd.Timestamp objects than the functions first() and second() per se.

zilto commented 10 hours ago

Investigation

Solution

Change the check for the base case in hash_value() to handle .__dict__ that are empty. A one line condition check now properly display the intended warning messages. Instead of hashing the value, a random UUID is provided. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION where data version is now a random UUID

image

side notes