[FEATURE] Config through environment variables

ckoerber commented 3 years ago

Is your feature request related to a problem? Please describe.

It might be useful to present the information present in the db-config.yaml and settings.yaml at least partially through environment variables.

This would simplify deploying on cloud providers (e.g., now one has to have a persistent volume up and create such files before one can run, e.g., tests)
This could make it easier for new projects to run from notebooks without providing a -e flag in the pip install
This could make deployments more secure as private information are not stored in files

Describe the solution you'd like

Have a clear priority order for checking if setting files exist and/or respective environment variables exist.
Highest priority have the DB config and secure keys. The rest may also live in files. However, it would be nice if there is a secure way to parse all possible environment variables without explicitly defining them. If environment variable has the proper name, it should be a project setting.
Improve error messages when not finding settings and how to resolve this (e.g., check if the module path is the python path and so on).
The variable namespace should be adjusted to the project name to not have conflicts between different EspressoDB projects (if virtual environments are not used). For example ESPRESSODB_{MYPROJECT}_DB_USER or maybe just {MYPROJECT}_DB_USER?

Questions:

Should we first look for files or environment variables?

ckoerber commented 3 years ago

Since all of the questions may depend on the preference of users, it seems to me the cleanest solution is to make them (backwards compatible) choices in the config parser functions.

For example, currently,

def get_project_settings(root_dir: str)

will find the project settings. In addition, one could provide

def configure_settings(use=("db-config", "environement"), fail_on_duplicate: bool = True, **kwargs) -> dict:
   options = dict()
   for source in reverse(use): # to ensure priority
      if source == "db-config":
          root_dir = kwargs.get("root_dir", ...)
          # try blocks?
          temp = get_project_settings(root_dir)
      elif source == "environement":
          temp = get_project_settings_environment(**kwargs)
      else:
          raise ValueError("Did not understand config source...")
     # Maybe some duplicate entry warnings depending on fail_on_duplicate
     options.update(temp)
  return options

and then also define a get_project_settings_environment method of the form

def get_project_settings_environment(
    prefix: str = "{poject_name}",
    load_variables: Optional[List[str]] = ("SECRET_KEY", "DEBUG")
):
     """
     prefix: prefix for environment variable names to be loaded in
     load_variables: load only selected variables if list or all starting with prefix if None
     """
     keys = [key for key in os.environ if key.startswith("prefix")]
     if load_variables is not None:
          keys = [key for key in keys if key.replace(prefix, "") in load_variables]
     options = {key.replace(prefix, ""): os.environ[key] for key in keys}
     # Some type parsing because env vars only allow strings
     return options

In this context, renaming get_project_settings to get_project_settings_yaml would also make sense, however, we need to keep the old one to not break compatibility.

Last but not least, newly created projects should use the new mechanic (also inserting provided dicts into the local context (same duplicate warning system)) and this feature should be well documented.

cchang5 commented 3 years ago

Alternatively, (or in addition) is it also possible to actually deploy the db from scratch. So if i have somehow defined a schema for the tables i want in a text file or json, that i only need to check in the schema definitions into git. e.g.

app0_name:
    - table0_name: {column1: {type: IntegerField, other_args: values}}
    - table1_name: {}....
app1_name:
    - table0_name: ...

then when i deploy, it'll run all the startproject and startapp commands. if we do this instead, then maybe what we have now will work already.

ckoerber commented 3 years ago

I have the impression that this request is an additional abstraction of the project creation / update and less a deployment option (which may be different on different machines). I.e., the above snippet is a yaml representation of the models.py. Does this capture the spirit of what you requested?

If so, I believe this rather belongs to #65 and can be viewed as a preliminary step (and we should shift the discussion there).

cchang5 commented 3 years ago

So the specific use case is that we're writing this thing that in principle will be handed off to multiple different groups in order to collect data. We want them to be able to simply pip install the library (with espressodb and the database being a dependency), and just run the high level script w/o additional tinkering. However pushing the entire folder spawned by Django feels like not a very elegant solution.

This also mirrors what a unittest looks like during a pull request. You want to check if you can spawn the defined database, and insert test cases into it as part of the workflow. So it is in general for any project pretty useful I think.

cchang5 commented 3 years ago

inside settings.py i made the change

# Database
# https://docs.djangoproject.com/en/3.2/ref/settings/#databases
DB_CONFIG = get_db_config(ROOT_DIR)
DB_CONFIG["NAME"] = f"{ROOT_DIR}/{DB_CONFIG['NAME']}"
DATABASES = {"default": DB_CONFIG}

and inside db-config.yaml i used

ENGINE: django.db.backends.sqlite3
NAME: name-db.sqlite

And this dynamically linked the sqlite3 to the project root directory. This way when anyone else pulls the code, i think it should run without hacking db-config.yaml

cchang5 commented 3 years ago

Here is a slight upgrade

# Database
# https://docs.djangoproject.com/en/3.2/ref/settings/#databases
DB_CONFIG = get_db_config(ROOT_DIR)
DB_DIR = os.environ.get("DB_DIR", ROOT_DIR)
DB_CONFIG["NAME"] = f"{DB_DIR}/{DB_CONFIG['NAME']}"
DATABASES = {"default": DB_CONFIG}

It'll either take DB_DIR as an environment variable, or default to ROOT_DIR.

ckoerber commented 3 years ago

I can see merit to the prepending DB_DIR idea as a default. The thing which may be confusing though is that the YAML takes Django defaults while this one prepends to the default. So if you give it a static path, this would modify it to a probably non-existent path. In this sense, I find it more applicable to add it to the doc as an example on how to modify it and users could choose to do so if it benefits there work.

So the specific use case is that we're writing this thing that in principle will be handed off to multiple different groups in order to collect data. We want them to be able to simply pip install the library (with espressodb and the database being a dependency), and just run the high level script w/o additional tinkering.

Would it also work (similar to your second update) if you specify the entire DB name via environment variables? E.g., the entire setup on a remote would look like

pip install git+https://github.com/repo/myproject#master@version
export {MYPROJECT}_DB_NAME=/abs/path/to/db.sqlite
export ... some other essential variables which are not hard coded
myproject-binary shell

The hard coded variables could be directly inserted in the settings.py file and will not change on different machines.

ckoerber commented 3 years ago

I have started to add these features on the environment-vars branch. This is preliminary and may change a tiny bit before being incorporated in the next version.

In short, the new functionality allows to modify the settings.py (espressodb startproject will default to the new layout) according to

# myproject/config/settings.py
...
from espressodb.management.utilities import set_db_config, validate_db_config
...
DATABASES = dict()
set_db_config(
    source="environment",
    context=DATABASES,
    database="default",
    environment_prefix="MYPROJECT_DB_"
)
validate_db_config(DATABASES)

This will now cause EspressoDB to parse the DB config from environment variables for your project. I.e., if you want to run a SQLITE DB, you should export this according to

export MYPROJECT_DB_NAME="/path/to/db.sqlite3"
export MYPROJECT_DB_ENGINE="django.db.backends.sqlite3"

It is also generally possible to hard code the engine in such a scenario, i.e.,

set_db_config(
    source="environment",
    context=DATABASES,
    database="default",
    environment_prefix="MYPROJECT_DB_"
)
DATABASES["default"]["ENGINE"] = "django.db.backends.sqlite3"
validate_db_config(DATABASES)

I still need to think a bit about ordering commands (in principle this allows to set up split file/environment setups), best practice, and potential security vulnerabilities (i.e., for other settings, I only allowed explicitly defined options). Also I will add more docs and tests.

Let me know if this helps.

callat-qcd / espressodb