DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.73k stars 111 forks source link

Add resolved_kwargs to data_saver and data_loader tags #1136

Open Riezebos opened 5 days ago

Riezebos commented 5 days ago

Is your feature request related to a problem? Please describe. When I have a built dataflow I would like to be able to see which paths are entered in @load_from and @save_to.

Describe the solution you'd like After executing the dataflow I can see the paths in the results, but I'd like to be able to see them without executing the dataflow.

Some metadata is already being written to tags: https://github.com/DAGWorks-Inc/hamilton/blob/main/hamilton/function_modifiers/adapters.py#L578

I tested adding the following line there:

                "hamilton.data_saver.kwargs": resolved_kwargs,

Then I tried running examples/parallelism/star_counting/run.py with the dr.execute statement replaced by:

    node = next(
        node for node in dr.list_available_variables() if node.name == "save.unique_stargazers"
    )
    print(node.as_dict()["tags"])

This gives the output I was hoping for:

{'hamilton.data_saver': True, 'hamilton.data_saver.sink': 'csv', 'hamilton.data_saver.classname': 'PandasCSVWriter', 'hamilton.data_saver.kwargs': {'path': 'unique_stargazers.csv'}}

Describe alternatives you've considered Maybe a custom DataLoader and DataSaver that store the arguments they were initiated with?

skrawcz commented 5 days ago

@Riezebos thanks for the issue. This sounds similar to another conversation @elijahbenizzy and @vograno were having about exposing bound values...

Question on your intended user experience. To confirm, it seems you'd be happy getting this via the node object you have above?

Riezebos commented 5 days ago

Yes, for me that would be great!

If I try to think of a potentially better ux, disregarding how the driver and tags are currently implemented it might look something like:

node = dr.get_node("save.unique_stargazers") # or a dictionary, but a way to get a node by name without iterating over them
if node.data_saver and node.data_saver.name == "csv":
    print(node.data_saver.kwargs)

But adding it to the tags that are already implemented would be a great solution in my opinion :)

elijahbenizzy commented 5 days ago

OK, adding in -- I think that this makes sense. Having a non-iteration access is good -- mind adding another issue on that?

For this, I think it makes sense to add as "attributes" -- mix in with this concpt: #1129.

Then we can attach the kwargs (as you did). These will be the non-resolved kwargs (e.g. with source in it still). We can probably also attach the same stuff at runtime with metadata-- e.g. just add a field materializer_metadata in the materialized metadata for everything that returns all the kwargs we have.

Riezebos commented 5 days ago

Regarding the non-iteration access, I created another issue: #1138