DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.71k stars 107 forks source link

Expose feature name to data savers #312

Open elijahbenizzy opened 1 year ago

elijahbenizzy commented 1 year ago

Courtesy of @janhurst: https://hamilton-opensource.slack.com/archives/C03MANME6G5/p1693385993967759.

Is your feature request related to a problem? Please describe. I've been paying around with the data savers and have implemented my own @save_to functionality. I have something that looks a little like this:

@save_to.feature(feature_name="my_feature_1")
def my_feature_1(foo: pd.Series, bar: pd.Series) -> pd.Series:
    return foo + bar

my saving logic is some existing infrastructure but essentially is just using the name to figure out the right place to save to.... but really i want the function name to be accessible in my DataSaver.... i couldn't see that this available at present... any ideas on how I could achieve this?

Describe the solution you'd like Let's add a __node_name field to the dataclass that gets auto-populated in data savers. We could theoretically add other metadata later on that starts with __, but I think this is a good start.

Describe alternatives you've considered The approach @janhurst is taking now is the right way, but this seems like a nice feature to have.

Additional context Add any other context or screenshots about the feature request here.

janhurst commented 1 year ago

I also originally got a little bit confused about the name attribute in the DataSaver, as that was what I wanted to use to give my DataSaver (instance!?) the "name" I was saving to... but having __node_name sounds great (assuming it is the name of the wrapped node not the saver node?)

It also too me a bit to get my head around the default output name that the @save_to wrapper ended up with... at first I was expecting my DataSaver to be implicitly called whenever I called a node.

The materializer semantics probably make a bit more sense here... what would __node_name see when the saver is called from a materializer?

elijahbenizzy commented 1 year ago

Ok, I think we can clarify this quite a bit -- currently it requires some details of how the system works.

The name in the materializer is the name of the node that does the materialization. The materialize call effectively creates the node that does it (as well as a node that joins the result, which won't exist if it's only materializing a single node, say a dataframe).

In the materialize case, __node_name will refer to the node that it immediately follows/whose result it saves. So, some nuance around which node that is in the case of the combiner getting applied, it that's an edge case.

The exciting news is that I've implemented it on my branch and should have a pr and a release soon!

skrawcz commented 1 month ago

@elijahbenizzy did you complete this?