PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 234 forks source link

Provenance tracking when settings come from environment variables #2635

Open infotroph opened 4 years ago

infotroph commented 4 years ago

Description

There are now a couple of places in the PEcAn code where environment variables can override the values recorded in settings.xml. This is often very useful, but has high potential to cause confusion and dilutes the principle that what you see in the settings file is the behavior you'll get. I'd like us to have a unified system for ensuring that any behavior altered by environment variables is predictable to minimize runtime confusion and reliably recorded as part of the provenance chain to minimize confusion when reasoning about runs after the fact.

Proposed solution

  1. Establish a defined hierarchy of the sources a setting can come from and document which sources override which others.
  2. Any time a parameter is changed by an environment variable, record the value change somewhere permanent. I don't know exactly what this should look like, but maybe it could be as simple as a logger.info line.
  3. Adopt a coding practice of keeping the effects of env vars localized. Ideally, any time you're reading a chunk of code whose behavior might change because of an env var, the fact that an env var was consulted ought to be visible rather than hidden a few layers away.

Additional Context

The main place this matters today is in PEcAn.DB::get_postgres_envvars, which was implemented in #2541 and briefly discussed in #2632. Important points from those threads:

@infotroph in #2541:

The big ol' downside: This function flagrantly breaks PEcAn's usual rule that the values in settings.xml always win over any conflicting defaults, and it does it in a way (environment variables) that's famously nontransparent and hard to debug unless you know what you're looking for. I expect we will not want to use this function very much outside of tests, but for tests I don't see any other good options for balancing reusable settings files against machine-specific database configurations -- I'm open to alternate proposals if anyone has them.

@infotroph in #2632 (comment):

A follow-up thought that might be a separate PR: I know that overriding defaults is the entire reason I implemented get_postgres_envvars in the first place, but from a provenance-tracking perspective I'm still not thrilled that it lets us silently override values that are present in the settings. Should we consider making get_postgres_envvars log the values it assigns? If so, further considerations include how to avoid leaking secrets and whether logging on every call is too much log spam.

@robkooper in #2632 (comment):

I was thinking as well about the order of things. It be nice if there was some order that we want to use and if a variable is defined in one place that will be the value that sticks. For example docker-compose has a very explicit listing of how they find a variable:

When you set the same environment variable in multiple files, here’s the priority used by docker-compose to choose which value to use:

  1. Compose file
  2. Shell environment variables
  3. Environment file
  4. Dockerfile
  5. Variable is not defined
github-actions[bot] commented 3 years ago

This issue is stale because it has been open 365 days with no activity.