DAGWorks-Inc / hamilton

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.
https://hamilton.dagworks.io/en/latest/
BSD 3-Clause Clear License
1.82k stars 123 forks source link

Documentation: add user guide on "configuration" + Hamilton #1047

Open skrawcz opened 3 months ago

skrawcz commented 3 months ago

Is your feature request related to a problem? Please describe. At some point, configuration, i.e. data that shapes the dataflow, or is used as input to the dataflow, needs to get to Hamilton some how.

In some circles this is a YAML file, in others it's code, in others it's command line arguments, etc.

We don't have a prescribed path to do this -- users would like our thoughts here.

Describe the solution you'd like We should have a user guide on this, that includes an example.

It should cover the following scenario that I think covers 80% of cases:

  1. Loading configuration for a staging vs production environment. e.g. table names, DB connection info.

There are few ways to do this, so we should show:

  1. how to do it with modules & code.
  2. how to do it by reading some YAML file.
  3. Some decorator constructs that can help.

This should then link to an example.

Describe alternatives you've considered N/A

Additional context

Dev-iL commented 2 months ago

I might be able to help with point (2.). Two questions:

  1. Is it ok to add another dependency (ruamel-yaml)?
  2. What are your plans regarding moving to pyproject.toml? I can take care of that if you want.
ChatGPT overview of configuration files vs CLI vs environment variables When configuring software for different types of runs, the choice between static configuration files, CLI (Command Line Interface) arguments, and environment variables depends on factors like flexibility, ease of use, security, and the environment in which the software operates. Here's an overview of the use cases for each: ### 1. **Static Configuration Files** - **Use Cases:** - **Complex or Large Configurations**: When the software requires detailed, hierarchical, or extensive configuration, static files (e.g., YAML, JSON, INI) provide a structured way to manage settings. - **Consistency Across Runs**: Static configuration files are ideal when you need to ensure that the same settings are applied consistently across multiple runs or instances. - **Version Control**: Configuration files can be version-controlled, allowing you to track changes, roll back to previous configurations, and ensure that configurations match the codebase. - **Environment-Specific Settings**: Files can be environment-specific, such as `config.prod.json` for production and `config.dev.json` for development, allowing different configurations based on deployment environments. - **Documentation**: Configuration files can be self-documenting, with comments or structure making it easier to understand and modify. - **Pros**: - Easily managed and shared within teams. - Supports complex configurations. - Good for long-term stability and consistency. - **Cons**: - Less flexible for quick changes. - May require more setup and maintenance. ### 2. **CLI Arguments** - **Use Cases:** - **Ad-Hoc or One-Off Runs**: Ideal for temporary or one-time runs where you need to quickly modify parameters without altering configuration files. - **Scripting and Automation**: When integrating the software into scripts or CI/CD pipelines, CLI arguments provide a convenient way to pass parameters dynamically. - **Quick Overrides**: CLI arguments are useful when you need to override configuration values specified in files or environment variables for a specific run. - **Interactive Use**: When users interact with the software manually and need to specify parameters directly, CLI arguments offer an intuitive interface. - **Pros**: - High flexibility for quick, temporary changes. - Can easily override other configuration sources. - Convenient for automation and scripting. - **Cons**: - Can become unwieldy for complex configurations. - Not ideal for long-term or consistent settings. ### 3. **Environment Variables** - **Use Cases:** - **Sensitive Data**: Storing sensitive information like API keys, passwords, or tokens in environment variables is often safer than embedding them in configuration files or passing them via CLI arguments, especially in containerized environments. - **Portability Across Environments**: Environment variables are useful when deploying software across different environments (e.g., dev, staging, production) where the same codebase is used but different configurations are needed. - **Configuration Management**: In cloud-native or containerized environments (e.g., Docker, Kubernetes), environment variables are commonly used for configuration management, allowing easy changes without modifying code or files. - **CI/CD Pipelines**: Environment variables are often used in CI/CD pipelines to inject configuration values at runtime. - **Pros**: - Secure handling of sensitive data. - Easily modified without changing code or files. - Works well with containerization and cloud deployments. - **Cons**: - Less visible and harder to document compared to configuration files. - Can lead to confusion if too many variables are used or if naming conventions are inconsistent. ### **Summary** - **Static Configuration Files** are best for complex, consistent, and long-term configurations that benefit from structure and version control. - **CLI Arguments** offer flexibility for quick, one-off changes and are excellent for automation and scripting. - **Environment Variables** are ideal for handling sensitive data and managing configurations across different environments, especially in cloud-native and containerized setups. In many cases, a combination of these methods is used to balance flexibility, security, and maintainability, with priority typically given in the order of CLI arguments > environment variables > configuration files for overrides.
Riezebos commented 1 month ago

It doesn't have to be config files vs CLI vs environment variables, it can be all of the above as well.

In a previous project I used pydantic to manage configuration, it used to be built into the package, now it is a plugin: https://docs.pydantic.dev/latest/concepts/pydantic_settings/

With pydantic-settings you can define a pydantic model for your settings, the values for the settings can automatically be read from:

You can define which settings sources you want to use and what should be their priority.

This is less simple than just reading env vars from os.environ but might be interesting to use as a more advanced case? I'm currently integrating hamilton into a project that uses this, I could create an example for the examples folder in this repo if you want.

skrawcz commented 1 month ago

I might be able to help with point (2.). Two questions:

  1. Is it ok to add another dependency (ruamel-yaml)?
  2. What are your plans regarding moving to pyproject.toml? I can take care of that if you want.

Sorry @Dev-iL forgot to respond here. For (1) we want Hamilton to be dependency light. So if anything an optional target is fine. For (2) Thanks for going ahead and doing that! 🙇 .

This is less simple than just reading env vars from os.environ but might be interesting to use as a more advanced case? I'm currently integrating hamilton into a project that uses this, I could create an example for the examples folder in this repo if you want.

@Riezebos Sure yes please. That would help ground the conversation here.

Note: I am expecting some differing opinions on approach here which relate to individual concerns and approaches. E.g. if you treat configuration like code, you should just make it Python code to simplify things...