DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.86k stars 123 forks source link

parameterize decorator should add values to default_parameter_values #1124

Open vograno opened 2 months ago

vograno commented 2 months ago

Is your feature request related to a problem? Please describe. When a vanilla node is created the default values of its parameters, if any, are available via node.default_parameter_values attribute. However, when a node is created by @parameterize, the values passed to the decorator are not available via the attribute. In fact they are not available from the node object at all.

The point I am making is that those values are essentially default values for the expanded node and should be surfaced in the same fashion. This feature would make expanded nodes equivalent to vanilla nodes in this regards.

Describe the solution you'd like Define a node

x_value = {}
@parameterize(output_node=dict(x=value(x_value))
def fun(x):
    return x

The following should hold

node = driver.graph.nodes['output_node']
# the passed in x_value and the stored value are the same objects
assert 'x' in node.default_parameter_values
assert node.default_parameter_values['x'] is x_value

# the passed in x_value and the object returned by the node are the same objects
result = driver.execute(final_vars=['output_node'])
assert result['output_node'] is x_value
skrawcz commented 2 months ago

To clarify, you wouldn't expect the following to work normally though right?

# this should not work
result = driver.execute(final_vars=['output_node'], inputs={"x": {"some": "value"}})

But I think I see where you're coming from now I think. The tricky part is how would you handle this:

x_value = {}
y_value = {}
@parameterize(
  output_node=dict(x=value(x_value), 
  output_node2=dict(x=value(y_value))
)
def fun(x: dict) -> dict:
    return x

Or is the case that all of them have a the same default value?


In the interim you could do something functionally equivalent like this:

@tag(marker="default") # could identify this as such
def x() -> dict:
   return {} # or reference some variable like you have above in the module

@tag(marker="parameterized")
@parameterize(output_node=dict(x=source("x")))
def fun(x: dict) -> dict:
    return x

then:

  1. ask the graph for all parameterized nodes via dr.list_available_variables({"TAG_NAME": "TAG_VALUE"})
  2. ask the graph for all default nodes via dr.list_available_variables({"TAG_NAME": "TAG_VALUE"})
  3. figure out which default values
  4. default nodes provide the originating function, so you can exercise it to get the value.
  5. if you wanted to change that value, you could just use overrides= via .execute()...
vograno commented 2 months ago

To clarify, you wouldn't expect the following to work normally though right?

# this should not work
result = driver.execute(final_vars=['output_node'], inputs={"x": {"some": "value"}})

This shouldn't work indeed.

But I think I see where you're coming from now I think. The tricky part is how would you handle this:

x_value = {}
y_value = {}
@parameterize(
  output_node=dict(x=value(x_value), 
  output_node2=dict(x=value(y_value))
)
def fun(x: dict) -> dict:
    return x

Or is the case that all of them have a the same default value?

How do I handle it or how to implement this? In terms of handling, this is precisely what I need: I have two nodes, output_node and output_node2, they return distinct values, and I can access those values via respective node objects.

Anyway, the default values are always different, i.e. x_value and y_value are distinct objects. This is in fact the whole point of it.

If it helps, one can add node.bound_parameter_values property and store, or compute on the fly, parameterize-bound values there instead of on default_parameter_values.

By compute on the fly I, for example, mean the case where the callable, which is already stored on the node object, is a partial object and you can pull the info from there.

In fact, if this is indeed the case and the callable is surfaced, I should be able to pull this info myself, at least in the interim.

skrawcz commented 2 months ago

So if you use driver.graph.nodes you're using internal APIs... but yes this works:


from hamilton.function_modifiers import parameterize, value

x_value = {"b": 2}
y_value = {"a": 1}

@parameterize(
  output_node=dict(x=value(x_value)),
  output_node2=dict(x=value(y_value))
)
def fun(x: dict, i: int) -> dict:
    return x

if __name__ == "__main__":

    from hamilton import driver
    import __main__ as main

    dr = driver.Builder().with_modules(main).build()

    # this is using internal APIs -- but yes you have access to the bound variables
    list(dr.graph.nodes.values())[0].callable.keywords
    list(dr.graph.nodes.values())[1].callable.keywords

However, the more "blessed path", would be to go via:

    vars = dr.list_available_variables()

However it seems that there's a πŸ› because the partial function isn't exposed in originating_functions -- looking at the code there's some slight inconsistencies with how expander decorators set the originating functions field. So four paths to get what you need:

  1. use internal Node and inspect the callable.
  2. we add func field to HamiltonNode, or figure out the right way to add the partial function to originating_functions field -- then you'd inspect that callable. CC @elijahbenizzy .
  3. we wire through a "bound variable" construct in Node through to HamiltonNode.
  4. you model the default value via the output of a function and wire it with source.
skrawcz commented 2 months ago

With https://github.com/DAGWorks-Inc/hamilton/pull/1125 then this would work:

vars[1].originating_functions[1].keywords

should have the partial function and what is is bound to.

elijahbenizzy commented 2 months ago

OK, so it seems like the requireent is that we want to know what the right way to expose this value. I think this clarifies that how we're doing it is not perfect IMO -- the concept of a "bound" variable is nice, and should be wired through.

We often do a ton of function-algebra (callables with other callables), but really in the case of parameterize/inject (and mabye even default vals), having a bound kwarg concept would actually simplify code.

I'm uncomfortable with layering the keywords onto the function and inspecting that -- it's too much internals. Unless we make everything a partial. But the partial application could be done within the node's callable with the originating function (in most cases). That said, if each node had:

(1) a function (2) a concept of bound variables

We could pretty easily expose this all with a bit of refactoring -- can take a look soon. My take from slack is that this is not an immediate blocker, more ergonomics, but let us know how it fits into your workflow!

vograno commented 2 months ago

Thank you guys. I appreciate your receptiveness and responsiveness. Take your time.

A couple of notes just in case

  1. For the function, consider leveraging the inspect module approach, the function signature in particular.
  2. For the bound variables, I assume both sources and values bindings will be provided.
elijahbenizzy commented 2 months ago

Thank you guys. I appreciate your receptiveness and responsiveness. Take your time.

A couple of notes just in case

  1. For the function, consider leveraging the inspect module approach, the function signature in particular.
  2. For the bound variables, I assume both sources and values bindings will be provided.

Yep! We use inspect heavily in implementing Hamilton :) Quite a useful module (and Hamilton would be impossible without it). Re: bound variables -- I think it's fundamentally different. The sources bindings will actually be edges in the graph -- thus we wouldn't need to expose it (it's available now if you just ask for the dependencies...).

vograno commented 2 months ago

But I wanted to know the name of the function parameter which the source binds to. This doesn't affect DAG execution, but it does affect the function-algebra stuff you mentioned earlier.

vograno commented 2 months ago

Here is a specific example

@parameterize(
  o1=dict(x=source('fun_x_input')),
)
def fun(x: dict, i: int) -> dict:
    return x

Assume that 'fun_x_input' is an input node and it feeds into fun only. However, the value of fun_x_input is not free, rather it depends on the exact function and exact argument of the function it feeds into. I wanted to have the ability to examine the node, figure out the function and the argument and compute the implied input value for fun_x_input. Then I'll execute the dag with fun_x_input set to the computed value. In general, the implied value for fun_x_input depends on the DAG topology, including the node function names and signatures, and some user-provide input.

elijahbenizzy commented 2 months ago

Here is a specific example

@parameterize(
  o1=dict(x=source('fun_x_input')),
)
def fun(x: dict, i: int) -> dict:
    return x

Assume that 'fun_x_input' is an input node and it feeds into fun only. However, the value of fun_x_input is not free, rather it depends on the exact function and exact argument of the function it feeds into. I wanted to have the ability to examine the node, figure out the function and the argument and compute the implied input value for fun_x_input. Then I'll execute the dag with fun_x_input set to the computed value. In general, the implied value for fun_x_input depends on the DAG topology, including the node function names and signatures, and some user-provide input.

Hmm, interesting... So the algebra of what happens under the hood is: (1) fun is created with input fun_x_input -- this is wired through to the original parameter of x (2) as fun_x_input does not exist in the DAG, a node named fun_x_input gets created

So in the current framework, it's a bit hidden how exactly it wires to the underlying function. In fact, the only contract is that it will get passed in some way, not how it will get passed/that you can access it.

But to get back to your point -- if you know that fun_x_input is a dependency offun`, you should be able to do what you want?

all_vars = dr.list_available_variables()
fun_var, = [var for var in all_vars if var.name == "fun"]
deps = var.required_dependencies | var.optional_dependencies
user_inputs = [var for var in all_vars if var.is_external_input and var.name in deps]
# this should be the information you need, in the `source` case.

Note this doesn't get the value case, but it does allow you to inspect the source. The fact that it is bound with @parameterize is more an implementation detail (as it currently stands).

vograno commented 2 months ago

Are you saying there will be a var named 'fun' even though the generated node name is 'o1'?

elijahbenizzy commented 2 months ago

un' even though the generate

Ahh, yes, sorry, didn't see that it was called o1, good point!

It should look like this:

all_vars = dr.list_available_variables()
o1_var, = [var for var in all_vars if var.name == "o1"]
deps = var.required_dependencies | var.optional_dependencies
user_inputs = [var for var in all_vars if var.is_external_input and var.name in deps]
# this should be the information you need, in the `source` case.

The fact that the function is named fun is lost in translation here -- it's not part of the contract.

Curious -- does this solve your problem? Or do you need further inspection into the way it was defined...

vograno commented 2 months ago

UPDATE: let me run your example to see what info I get.

I still need to know the function name and signature and the exact argument it is wired too. One can argue that this info is encoded in the source name, but arguably this is a hack.

Compare it to wrappers in python, https://docs.python.org/3/library/functools.html#functools.update_wrapper. It does provide access to the wrapped function via __wrapped__ attribute. One can argue that a parameterized node is a wrapper around the function. If this information is too "internal" to make its way to HamiltonNode it can be stored on the expanded node.

elijahbenizzy commented 2 months ago

python, https://docs.python.org/3/library/functools.html#functools.update_wrapper. It does provide access to the wrap

So I'm not 100% sure that these are analagous to standard decorators. The decorators actually all set an attribute on the function that is compiled later -- they don't modify the function at all. But the Hamilton Graph does parse it.

Some details here on how it is designed (for better or worse):

By the time it gets to the Hamilton graph, it has gone through multiple transformations. There's a lot of stuff we do based on the different decorators -- you can see the general process here. This is the function (or series of functions, there are cases in which we may have multiple) that get called eventually. See this field. So I think that does give you access to the originating function. But it's not as straightforward a mapping as a simple decorator/modifying wrapper --quite a few other transformations can occur. This also doesn't have the bindings (as we discussed) -- those occur in the creation of the callable (used by the internal Node), which is (currently) a black box. E.G. the decorators outline the contract (what values will be fixed to that function, what nodes will be attached), but doesn't give you visibility one step further. This appears to be what you want?

Otherwise, I'd be really curious as to how you're using the data. Most designs are built around a core set of assumptions that help us decide what should be internal/publicly exposed, and it's always interesting to know how those might not be true for everyone! So if you have an explanation of how you might use this/what consumes this, that would be very helpful πŸ™ And curious if the code I showed you works (it should get at what you're attempting, I think...).

vograno commented 2 months ago

It should look like this:

all_vars = dr.list_available_variables()
o1_var, = [var for var in all_vars if var.name == "o1"]
deps = var.required_dependencies | var.optional_dependencies
user_inputs = [var for var in all_vars if var.is_external_input and var.name in deps]
# this should be the information you need, in the `source` case.

The fact that the function is named fun is lost in translation here -- it's not part of the contract.

Curious -- does this solve your problem? Or do you need further inspection into the way it was defined...

I still need the function. What the code snippet gives me is the topology of the graph. The function is a property of the node. I might take different decisions depending on whether it's sin or log even though signature-wise they are identical.

Let me suggest an alternative approach. How about allowing users to attach free-form attributes to nodes. Much like tags, but free-form. Unlike tags, Hamilton should never be tasked to interpret attributes, just to accumulate them along node expansion paths and eventually store them on a HamiltonNode.

Example 1 Common attributes for expanded nodes

a_value = {}
b_value = dict(abc=[])
common_attrs = dict(a=a_value, b=b_value)
@parameterize(
  o1=dict(x=source('o1_x_input')),
  o2=dict(x=source('o2_x_input')),
)
@attr(common_attrs)
def fun(x: dict, i: int) -> dict:
    return x

Here I expect HamiltonNodes o1.attributes and o2.attributes to be {**common_attrs} each.

Example 2 Provide distinct attributes for each expanded node

o1_attrs = dict(name='o1')
o2_attrs = dict(name='o2')
@parameterize(
  o1=dict(x=source('o1_x_input'), None=value(o1_attrs)),
  o2=dict(x=source('o2_x_input'), None=value(o2_attrs)),
)
@attr(common_attrs)
def fun(x: dict, i: int) -> dict:
    return x

Note the None key in the parameterize args. Here I expect HamiltonNodes o1.attributes == {common_attrs, o1_attrs}

Example 3

Use attributes to capture the function. I should be able to roll up my own decorator, say capture_func_name and use it like this

@capture_func_name
def fun(x: dict, i: int) -> dict:
    return x

which should be equivalent to

def fun(x: dict, i: int) -> dict:
    return x

fun = attr(dict(func=fun))(fun)
elijahbenizzy commented 2 months ago

It should look like this:

all_vars = dr.list_available_variables()
o1_var, = [var for var in all_vars if var.name == "o1"]
deps = var.required_dependencies | var.optional_dependencies
user_inputs = [var for var in all_vars if var.is_external_input and var.name in deps]
# this should be the information you need, in the `source` case.

The fact that the function is named fun is lost in translation here -- it's not part of the contract. Curious -- does this solve your problem? Or do you need further inspection into the way it was defined...

I still need the function. What the code snippet gives me is the topology of the graph. The function is a property of the node. I might take different decisions depending on whether it's sin or log even though signature-wise they are identical.

Let me suggest an alternative approach. How about allowing users to attach free-form attributes to nodes. Much like tags, but free-form. Unlike tags, Hamilton should never be tasked to interpret attributes, just to accumulate them along node expansion paths and eventually store them on a HamiltonNode.

Example 1 Common attributes for expanded nodes

a_value = {}
b_value = dict(abc=[])
common_attrs = dict(a=a_value, b=b_value)
@parameterize(
  o1=dict(x=source('o1_x_input')),
  o2=dict(x=source('o2_x_input')),
)
@attr(common_attrs)
def fun(x: dict, i: int) -> dict:
    return x

Here I expect HamiltonNodes o1.attributes and o2.attributes to be {**common_attrs} each.

Example 2 Provide distinct attributes for each expanded node

o1_attrs = dict(name='o1')
o2_attrs = dict(name='o2')
@parameterize(
  o1=dict(x=source('o1_x_input'), None=value(o1_attrs)),
  o2=dict(x=source('o2_x_input'), None=value(o2_attrs)),
)
@attr(common_attrs)
def fun(x: dict, i: int) -> dict:
    return x

Note the None key in the parameterize args. Here I expect HamiltonNodes o1.attributes == {common_attrs, o1_attrs}

Example 3

Use attributes to capture the function. I should be able to roll up my own decorator, say capture_func_name and use it like this

@capture_func_name
def fun(x: dict, i: int) -> dict:
    return x

which should be equivalent to

def fun(x: dict, i: int) -> dict:
    return x

fun = attr(dict(func=fun))(fun)

So yeah, don't know your use-case/what you need -- my general skepticism is that the contract should be at the level of the graph, not the function.

E.G. if you have a node for sin or log they should be able to capture it in the node name. And maybe use configuration to toggle between. At a high-level, I'm really curious about the data you want to store in the function itself, rather than the node -- to me that indicates that there is something missing in the graph/node abstraction, but I don't have the full context to quite place it...

That said, attrs seems reasonable, and not too far from what we have, actually. Hold off, for a second, on the intended use of tags (which yes, are for querying/processing). If you just JSON-dump your params, then you get around the types. You can then use tag_outputs to have different tags per node produced.

Now, the really cool thing is that you can define whatever higher-level decorator you want. E.G. you can define your own parameterization that:

  1. Captures the fn name + any other metadata
  2. Calls out to parameterize
  3. JSON-dumps the attrs
  4. Calls out to tag_outputs
  5. Returns the fn

And you'd be using Hamilton decorators as building blocks. Customizing/wrapping decorators is highly encouraged, given that we can only support so much surface area but want everyone to be able to satisfy their use-case.

vograno commented 2 months ago

So yeah, don't know your use-case/what you need -- my general skepticism is that the contract should be at the level of the graph, not the function.

I agree. I got carried away with the idea of parameterized nodes being kind of wrappers. There is no need for this.

Using tags to store JSON is a good idea, thanks!

elijahbenizzy commented 2 months ago

So yeah, don't know your use-case/what you need -- my general skepticism is that the contract should be at the level of the graph, not the function.

I agree. I got carried away with the idea of parameterized nodes being kind of wrappers. There is no need for this.

Using tags to store JSON is a good idea, thanks!

To be clear, I think there might be room for more free-form tags/attrs -- that could be a valuable issue to open up!