DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.75k stars 112 forks source link

Autogenerate argparser to execute Driver from command line #448

Closed zilto closed 5 months ago

zilto commented 11 months ago

Is your feature request related to a problem? Please describe. When deploying scripts in SageMaker or VertexAI, configuration needs to be passed via CLI / argparse. This requires writing tedious argparse code, which often ends up being poorly typed, documented, and maintained.

Describe the solution you'd like Automatically generate argparser (or OmegaConf/Hydra) for Hamilton nodes of the instantiated driver. Each parsing argument could include the type and docstring (if not a top node). It is possible to resolve what should be in inputs and in overrides. final_vars could also be specified.

import argparse

class Driver:
    ...

    def with_cli(self):
        parser = argparse.ArgumentParser(prog="HamiltonCLI", description="Generated CLI")
        for n in self.graph.get_nodes():
            parser.add_argument(f"--{n.name}")

        self.args = parser.parse_args()

    def resolve_args(self):
        inputs, overrides = resolve_node_value(self.args)
        action, kwargs = resolve_action(self.args)  # visualize, execute

        if action == "execute":
            self.execute(inputs=inputs, overrides=overrides, **kwargs)

if __name__ == "__main__":
    dr = (
        driver.Builder()
        .with_module(transforms)
        .with_cli()  # include arguments to limit supported operations (e.g., execute only)
        .build()
    )

Alternative A simpler and more explicit approach could be passing two list of strings to expose input and override nodes. This prevents having a CLI flooded with irrelevant args

if __name__ == "__main__":
    dr = (
        driver.Builder()
        .with_module(transforms)
        .with_cli(inputs=list(), overrides=list())
        .build()
    )

The nodes supported would be limited to primitives that can be expressed on the command line. One challenge is properly coercing args, which are all strings, into the correct Hamilton type. This could be done efficiently with Pydantics

skrawcz commented 11 months ago

yep -- https://typer.tiangolo.com/ could perhaps make this simpler?

zilto commented 11 months ago

Another clean pattern could be to decorate functions with @cli and then .with_cli() collects that when building the driver. However, it wouldn't be possible to annotate top level nodes

skrawcz commented 11 months ago

Another clean pattern could be to decorate functions with @cli and then .with_cli() collects that when building the driver. However, it wouldn't be possible to annotate top level nodes

Yeah without instantiating a driver and knowing the requested outputs we wouldn't know what they are. But that doesn't mean we couldn't have something dynamic... or alternatively we just have a command line that creates a CLI file for a given driver set up...