geosolutions-it / evo-odas

Code Repository for the EVO-ODAS
https://waffle.io/geosolutions-it/evo-odas
MIT License
31 stars 15 forks source link

Conventions for DAG names and Configuration Keys #118

Closed randomorder closed 7 years ago

randomorder commented 7 years ago

As per point 2 and 3 in the email on EVO-ODAS mailing list ("Airflow Ingestion Review" summed up below ) we should adopt conventions for DAG names and configuration keys:

DAG Naming Convention {COLLECTIONID}{NAME}, e.g. S2_MSI_L1C_Download, so that all DAGs belonging to a collection show up grouped together in the airflow GUI

Configuration Key Naming Convention configuration keys needs to be prefixed and abstract in some case.

Eventually we should create a directory "config" that contains all configuration files and secrets? Like:

airflow/config/common.py
airflow/config/secrets.py
airflow/config/sentinel1.py
airflow/config/sentinel1.py
ricardogsilva commented 7 years ago

@randomorder @simboss

This is a follow up on the discussion on #161:

Regarding the configuration, how about reusing the airflow.cfg file that already houses the config for airflow and store our custom config there too?

Airflow already features an API for getting configuration from its config file and we can easily hook into it too:

# file: airflow.cfg

# all other airflow configs go here...
[my_section]
config_var1 = some_value
config_var2 = another

# file: elsewhere, wherever we need it
from airflow import configuration

var1 = configuration.get("my_section", "config_var1")

Airflow even features a nice way to access environment variables through this config API too. This can be used to store secrets in the environment, as proposed in #159:

# define secrets in the environment, directly using EXPORT
export AIRFLOW__MY_SECTION__SOME_SECRET_PASSWORD=dummy

# or by keeping an .env file that is sourced and exported
set -o allexport; source my-top-secret-file.env; set +o allexport
# now retrieve secrets with airflow
from airflow import configuration

some_secret_password = configuration.get("my_section", "some_secret_password")
torse commented 7 years ago

That's actually a good approach if we don't need/want python expression for setting the configuration items. As I've evaluated the approach of using python modules for configuration, I'll try to quickly explain how this would look like for sake of comparison and discussion.

I've failed using importlib as I couldn't make it override variables in a common module namespace. Basically I ended up having a config directory with an __init__.py that looks like this:

from settings import *
from override.settings import *
from secrets import *
from override.secrets import *
from s2_msi_l1c import *
from override.s2_msi_l1c import *
...  

Then one can define all configuration keys that are required in modules like settings.py, secrets.py and s2_msi_l1c.py. The secret items have of course dummy/empty values. There is also a override directory that is gitignored. There one can override specific values, like paths, hostnames, etc.

The config would then be usable from the DAGs like that:

...
import config as CFG
...
# DAG definition
dag = DAG(CFG.collection_id + "_Download",
    description='DAG for searching, filtering and downloading Sentinel '+CFG.collection_id+' data from DHUS server',
    default_args=default_args
)

# Sentinel 2 DHUS Search Task
search_task = DHUSSearchOperator(
    task_id='dhus_search_task',
    dhus_url=CFG.dhus_url,
    dhus_user=CFG.dhus_username,
    dhus_pass=CFG.dhus_password,
    geojson_bbox=CFG.dhus_search_bbox,
    startdate=CFG.dhus_search_startdate,
    enddate=CFG.dhus_search_enddate,
    platformname=CFG.collection_platformname,
    filename=CFG.collection_filename_filter,
    filter_max=CFG.dhus_download_max,
    order_by=CFG.dhus_search_orderby,
    dag=dag
)

In contrast to the airflow.cfg approach, I would say there are currently a few places where I see python code would have advantages:

  1. Setting relative dates, e.g. in config/settings.py:
    from datetime import datetime, timedelta
    dhus_search_startdate = datetime.today() - timedelta(days=4)
  2. Defining paths to folders, e.g. in config/s2_msi_l1c.py:
    import os
    collection_id = "S2_MSI_L1C"
    collection_repository_dir = os.path.join(config.repository_base_dir, "SENTINEL/S2/", collection_id)
  3. Using service discovery and centralized password tools like vault, e.g. in the override config/override/secrets.py:
    
    import os
    import hvac

Connect to Vault via TLS

client = hvac.Client(url=os.environ['VAULT_ADDR'], token=os.environ['VAULT_TOKEN']) secrets = client.read('secret/evoodas/production').get('data')

dhus_username = secrets.get('dhus_username') dhus_password = secrets.get('dhus_password')


However, I am sure we can work around this with the concept of dynamic `_cmd` keys in the `airflow.cfg`, e.g.

[core] dhus_search_startdate_cmd = 'date -d "-5days"' dhus_username_cmd = "vault read -field=dhus_username 'secret/evoodas/production'"



Both approaches would work for me. Which do you prefer?
ricardogsilva commented 7 years ago

Personally I prefer keeping config files as simple as possible and separate from the code - much like what is described in the 12factor app . Usually I prefer yaml files for storing config, but I think that .cfg files also work well for when the configuration is made of simpler values with shallow hierarchies, which seems to be the case here.

As for your remarks on the advantages of storing config in python files, IMHO I don't see them really as advantages. The examples that you provide seem like actual code and not configuration:

  1. Setting relative dates

I'd save only the delta as a config and would calculate the actual datetime later, in the code.

  1. Defining paths to folders

I'd maybe define just the base directory in the config and would calculate the full path in code. This does not seem like something I'd want to allow to change very often, besides the base directory.

  1. Using service discovery and centralized password tools

Again, to me it seems like there is no need to include the code to actually retrieve the secrets from the vault inside the config file. In this case I'd load the vault's credentials from the environment and would not store them in the app config at all. I'd go take the values from the vault inside the actual application code

I'd also prefer not to use dynamic _cmd keys in the config as well. I'd like to keep code out of config as much as possible. I prefer dumb config files. They are easier to read and also easier to write and less prone to bugs.

Another aspect that seems cleaner with the .cfg approach is that it negates the need to create and deal with all those additional python config files (and directories and __init__.py files), especially when using airflow, which already does the parsing of the cfg file and injection of the environment variables. Plus all the from conf import * imports are usually considered a bad pattern in Python because you lose track of where stuff is being imported from and also risk having name clashes.

All that said, I guess I am really supporting the 'store config in airflow.cfg and keep config as dumb as possible' option. However, I'll work OK with other alternatives too, if that is the way the decision shall be. :)

randomorder commented 7 years ago

@ricardogsilva @torse

I've not tested either of the two approaches myself but here follows my thoughts on this: Separate config form the code is nice as long as it is simple and flexible: As you both pointed out taking the Python approach has pros and cons but I like the flexibility it provides. I don't think name clashing or losing track of where the keys come from is an issue as long as we use conventions for variables / object names. Also this approach allows to glue-in whatever method for retrieving the secret the user wants (see the vault example). Regarding hierarchy I think the ability to hierarchically store the keys Really makes sense in our use case, for instance global level, satellite level, product level

Both approaches make sense but my preference is toward @torse 's approach. @ricardogsilva please proceed accordingly

torse commented 7 years ago

@ricardogsilva @randomorder

I guess in the end it is just a matter of perspective. Is config.py code and is source .env configuration? Regardless, because at some point the application has to deal with "configuration items" collected from different sources. With the flexibility of the "Python Config" approach one can handle it close to the application and can even switch relatively easy to the airflow.cfg approach if needed without touching the DAG/plugin code (just replace e.g. resource_base_dir = '/var/data/resources/' with resource_base_dir = configuration.get("core", "resource_base_dir")).

So when separating the code and config like this, the only critical issue is the definition of the config hierarchy, item names and the way the application code accesses these items. So here is what I've got so far, feel free to discuss, use and improve it as needed. I can also prepare a PR if required.

airflow/dags/config/settings.py:

import os

#
# Paths
#
resource_base_dir = '/var/data/resources/'
download_base_dir = '/var/data/download/'
process_base_dir = '/var/data/process/'
regions_base_dir = '/var/data/regions/'
repository_base_dir = os.getenv('REPOSITORY_BASE_DIR','/var/data/repository/') # example on how this config can be controlled through env parameter e.g. from docker/docker-compose

#
# Connections
#
dhus_url = 'https://scihub.copernicus.eu/dhus'
dhus_username = ''
dhus_password = ''

geoserver_rest_url = 'http://localhost:8080/geoserver/rest'
geoserver_username = 'admin'
geoserver_password = ''

postgresql_dbname = 'oseo'
postgresql_hostname = 'localhost'
postgresql_port = '5432'
postgresql_username = 'postgres'
postgresql_password = ''

rsync_hostname = 'geoserver.example.com'
rsync_username = 'ec2-user'
rsync_ssh_key = '/usr/local/airflow/id_rsa'
rsync_remote_dir = ''

airflow/dags/config/override/secrets.py (override dir is gitignored):

import os
import hvac

client = hvac.Client(url=os.getenv('VAULT_ADDR'), token=os.getenv('VAULT_TOKEN'))
secrets = client.read('secret/evoodas/production').get('data')

geoserver_password = secrets.get('geoserver_password')

dhus_password = 'mysupersecretpassword'

airflow/dags/config/s2_msi_l1c.py:

from datetime import datetime, timedelta
import os
import config

#
# Collection
#
collection_id = "S2_MSI_L1C"
collection_filename_filter = "S2*_MSIL1C*"
collection_platformname = 'Sentinel-2'

collection_download_dir = os.path.join(config.download_base_dir, collection_platformname, collection_id)
collection_process_dir = os.path.join(config.process_base_dir, collection_platformname, collection_id)
collection_repository_dir = os.path.join(config.repository_base_dir, "SENTINEL/S2/", collection_id)

collection_template_eop = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_eop.json")
collection_template_iso = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_metadata.xml")
collection_template_description = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_description.html")

collection_bands_res={'10':("B02","B03","B04","B08"),'20':("B05","B06","B07","B8A","B11","B12"),'60':("B01","B09","B10")}
collection_disk_quota = "50g"
collection_regions = ["europe", "global"]
collection_europe_disk_quota = "25g"
collection_global_disk_quota = collection_disk_quota
...

#
# DHUS specific
#
dhus_url = 'https://code-de.org/dhus'
dhus_username = config.dhus_username
dhus_password = config.dhus_password
dhus_search_bbox = os.path.join(config.regions_base_dir,'europe.geojson')
dhus_search_filename = collection_filename_filter
dhus_search_startdate = datetime.today() - timedelta(days=4)
dhus_search_startdate = dhus_search_startdate.isoformat() + 'Z'
dhus_search_enddate = datetime.now().isoformat() + 'Z'
dhus_search_orderby = '-ingestiondate,+cloudcoverpercentage'
dhus_download_max = 1
dhus_download_dir = collection_download_dir

airflow/dags/config/__init__.py:

from settings import *
try:
    from override.settings import *
    from override.secrets import *
except ImportError:
    pass

from s2_msi_l1c import *
try:
    from override.s2_msi_l1c import *
except ImportError:
    pass

from s1_grd import *
try:
    from override.s1_grd import *
except ImportError:
    pass

With the above __init__.py the config can be used in DAGs/plugins like this:

from config import settings as CORE
from config import s2_msi_l1c as S2

print(CORE.dhus_username)
print(S2.collection_id)

OR

import config as CFG

print(CFG.s2_msi_l1c.dhus_username)
print(CFG.s2_msi_l1c.collection_id)

Some Notes:

  1. I am not so happy with the setup the __init__.py. Maybe glob find the files and iterate over them instead of the try except them? Anything else?
  2. I've duplicated some items in the collections (like e.g. dhus_username = config.dhus_username in s2_msi_l1c.py) so that the collection configuration is self-sustaining and accessible via the collection namespace, e.g. S2.dhus_username. Not sure if this is better then accessing it via CORE.dhus_username directly.
  3. It might enhance readability if we strip the "collection_" prefix from the collection configurations (S2.collection_id -> S2.id).
  4. I am not sure how to best deal with something like AOI/region-specific configuration (see collection_regions above).

Opinions? :)

ricardogsilva commented 7 years ago

@torse @randomorder

I'm still not convinced, but I'll happily go with the chosen approach of storing the config in python files then :)

Regarding @torse proposal:

TL; DR - I think there are too many config keys and too many config files - we could simplify things.

In my opinion there are currently too many config keys. I'd prefer if we could cut a few of them in order to make things easier to manage:

resource_base_dir = '/var/data/resources/' download_base_dir = '/var/data/download/' process_base_dir = '/var/data/process/' regions_base_dir = '/var/data/regions/'

How about a single base_dir setting? The others would be derived from this one, inside the code without needing to expose their values to the config?

dhus_url = 'https://scihub.copernicus.eu/dhus' dhus_username = '' dhus_password = '' geoserver_rest_url = 'http://localhost:8080/geoserver/rest' geoserver_username = 'admin'

How about using Relative URLs as defined by RFC 1808, which would make them easy to parse with urlparse?

dhus_url = "https://{user}:{password}@scihub.copernicus.eu/dhus"
geoserver_rest_url = "http://admin@localhost:8080/geoserver/rest"

geoserver_password = '' postgresql_dbname = 'oseo' postgresql_hostname = 'localhost' postgresql_port = '5432' postgresql_username = 'postgres' postgresql_password = ''

How about using sqlalchemy engine URLs for defining database connections? This would mean a having just a single setting with

database = "postgresql://{user}:{password}@{host}:{port}/{db}"

rsync_hostname = 'geoserver.example.com' rsync_username = 'ec2-user' rsync_ssh_key = '/usr/local/airflow/id_rsa' rsync_remote_dir = ''

For rsync+ssh, we could instead apply the same URL rules as above plus generate a suitable ~/.ssh/config file like

# file: ~/.ssh/config
Host geoserver.example.com
    User ec2-user
    IdentityFile ~/id_rsa

and the corresponding setting would be rsync_url = ssh://geoserver.example.com/?

This would mean going from 20 config keys to 5.

As for collections config, I'd apply the same reasoning:

collection_download_dir = os.path.join(config.download_base_dir, collection_platformname, collection_id) collection_process_dir = os.path.join(config.process_base_dir, collection_platformname, collection_id) collection_repository_dir = os.path.join(config.repository_base_dir, "SENTINEL/S2/", collection_id) collection_template_eop = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_eop.json") collection_template_iso = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_metadata.xml") collection_template_description = os.path.join(config.resource_base_dir, collection_platformname, collection_id + "_description.html")

IMHO all these could be omitted from the config and just derived in the code. I think these don't need to be configurable. Lets provide a default value for these keys that makes sense and enforce consistency throughout deployments by having the same internal path structure. It also seems to me to be more natural to order paths according to their collection, instead of according to their role in the system. I mean instead of having:

# downloads dir
/var/data/download
    /{platform_1}/{collection_1}
    /{platform_1}/{collection_2}
    ...
    /{platform_n}/{collection_m}

# process dir
/var/data/process
    /{platform_1}/{collection_1}
    /{platform_1}/{collection_2}
    ...
    /{platform_n}/{collection_m}

I'd maybe use:


/var/data/{collection_1}
    /downloads
    /process

...

/var/data/{collection_n}
    /downloads
    /process

Personally (from my experience working at a data production center for EO products for more than 5 years) I find it easier to navigate paths when they are oriented towards products than processes. Whatever the case, I'd not expose all these paths in the settings, just the base one (/var/data in this case).

Regarding your final notes:

  1. airflow/dags/config/__init__.py file I agree that it is ugly. Could we make things simpler?

    • Lets use a single file for all settings instead of one file per collection - that would simplify the imports. If we cut down about 3/4 of the current settings keys (like my proposal) this should be manageable;
    • Lets not have an override directory at all. Make a distinction between installation settings and application settings:

      Secrets and settings which are dependent on the installation are installation settings - provide a means to get them from the environment

      Settings that are kept in the config file are application settings. Keep these in the repository with their intended values and use that. Whenever you change a setting that is stored in the config file you'll need to commit this change to the repository. This means treating config as code and making it part of the release cycle, resulting in a more controlled dev -> test -> deploy workflow

  2. This would be a non issue if we'd use a single settings file.
  3. Since we are going with python config files, why not encapsulate collection settings in a dict? That would facilitate naming the settings.
  4. I've been focusing exclusively on the Landsat8 DAGs until now so I'm missing the bigger picture in order to be able to comment on this. Landsat8 AOIs seem to need a path and row though.
torse commented 7 years ago

@ricardogsilva: Thanks for your feedback :)

TL;DR: The basic idea behind the flattened and somewhat complex configuration is to simplify the actual application code (DAG/plugins) and enforce environmental constraints (e.g. paths). The drawback of simplifying the config is that the complexity is pushed somewhere else. Personally I am fine with a more complex configuration layer that validate and prepares settings for the application layer, but I understand that's not everyone preference. We need to decide to either simplified the config or the code? Sure, ideally we do both so let's try.. ;)

How about a single base_dir setting? The others would be derived from this one, inside the code without needing to expose their values to the config?

Right, that would make config simpler. The reasoning/requirement behind it was that the download, process and repository directories should have a dedicated partition/volumes in the system. Therefore it wouldn't be wise to let the application code choose it. BTW, that's also why there is this /var/data/download/{collection_1} path order instead of /var/data/{collection_1}/download. For me it would be fine if we reduce it to base_dir for the resources, download and process directory.

However I would keep the repository_dir as this is supposed to store all the original products and extracted/processes granule which are TB++. We should also probably define a structure for the products/granules like repository_dir_pattern={root}/{mission}/{sensor}/{collection}/{year}/{month}/{day} and think about where to store the ZIPs.

How about using Relative URLs as defined by RFC 1808, which would make them easy to parse with urlparse?

The basic reasoning for the flattened and admittedly complex config items was to simplify the code. We can simply the config with URL/DB connections like you propose, but as a drawback we'll get more complex code:

from

catalog = Catalog(CFG.geoserver_rest_url, CFG.geoserver_user, CFG.geoserver_password)

to

url = urlparse(CFG.geoserver_url)
catalog = Catalog(url.scheme + url.hostname + url.port + url.path, url.username, url.password)

Sure, in this example is not a big deal. Personally however, I prefer to have more config items if they help to make the code more compact and readable.

For rsync+ssh, we could instead apply the same URL rules as above plus generate a suitable ~/.ssh/config file like ... and the corresponding setting would be rsync_url = ssh://geoserver.example.com/? ... As for collections config, I'd apply the same reasoning:

As above, should we define this as config items or derived it in the code? I am fine with both approaches. Let's see what others think.

It also seems to me to be more natural to order paths according to their collection, instead of according to their role in the system.

+1

Regarding the other points:

  1. I am a bit skeptical about ditching the override functionality and use the env for all installation specific settings. Because I think there are quite a lot (even e.g. disk quota for specific collections are env specific..). Also I cannot use e.g. vault in python code to retrieve passwords anymore but am forced to do this in an private .env file that get's either sourced in entrypoint.sh or declared in docker-compose.yml. Again this moved complexity out of the application config code somewhere else (here the environment). Anyway, as you I am happy to create .env files for our enviroment if we decide to go down this road.. :)

  2. Ok, but the collection configs then need to be prefix like this, right? settings.py

    s2_msi_l1c.id = "S2_MSI_L1C"
    ...
  3. The dict would make it harder to override specific settings (and doesn't have IDE auto completion support, AFAIK) but with a single settings file this would no longer be an issue. So, ok for me.

  4. We can review this later on, just wanted point out that this will be important also for the configuration as AOIs might override global collection settings.

randomorder commented 7 years ago

@torse

I guess in the end it is just a matter of perspective. Is config.py code and is source .env configuration? Regardless, because at some point the application has to deal with "configuration items" collected from different sources. With the flexibility of the "Python Config" approach one can handle it close to the application and can even switch relatively easy to the airflow.cfg approach if needed without touching the DAG/plugin code (just replace e.g. resource_base_dir = '/var/data/resources/' with resource_base_dir = configuration.get("core", "resource_base_dir")).

looks like a win-win situation we get flexibility and can fall back to the cfg method later on if we want to

I am not so happy with the setup the init.py. Maybe glob find the files and iterate over them instead of the try except them? Anything else?

Yes is a bit tricky to do without knowing how Airflow is loading the DAG files. I have tried other ways without success. I'll let you know if I find a better way.

It might enhance readability if we strip the "collection_" prefix from the collection configurations (S2.collection_id -> S2.id).

+1