DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.88k stars 126 forks source link

fingerprinting base case handles empty __dict__ #1243

Closed zilto closed 3 days ago

zilto commented 3 days ago

Fixes #1242; copying reply from the issues thread

Problem

Using the cache with these two functions would create a collision and both would return pd.Timestamp("2021-01-01") on the 2nd execution (i.e., when retrieving values from cache)

import pandas as pd

def first() -> pd.Timestamp:
  return pd.Timestamp("2021-01-01")

def second() -> pd.Timestamp:
  return pd.Timestamp("2021-01-02")

Solution

The source of the bug is an oddity of pandas. Some objects have an empty __dict__ attached. A one line condition check now properly displays the intended warning messages.

When encountering this case, Hamilton gives a random UUID instead of hashing the value. Caching can still work because the cache key is NODE_NAME-CODE_VERSION-DATA_VERSION where data version is now a random UUID

image

Investigation

side notes