yuce commented 7 years ago

Pilosa Configuration Proposal

Authors

Yuce Tekol yuce@pilosa.com
Cody Soyland cody@pilosa.com
Matt Jaffee jaffee@pilosa.com

Change log

2017-02-27: Updated default data directory
2017-02-09: Updated Address to include Scheme
2017-02-03: Second draft
2017-02-01: First draft

Abstract

This document contains the proposals for command line, configuration file and environment variable naming and their priorities.

Overview

Configuration is one of the most important parts of software. If a consistent naming convention is not used, it may become hard to configure software. If the software doesn't support the same configuration as in the documentation, the users of the software may get frustrated. Because of those reasons, it is beneficial to have a single reference of configuration options, which Pilosa developers use during development and which can be used to write/update the documentation.

Most common configuration comes from the command line, a configuration file and environment variables. What set of options should be supported by these configuration sources? If the same option is used in two different sources, which should have the upper hand? These questions should be clearly answered to decrease the number of surprises during operation of the software.

In summary, the aims of this document are:

To be an up-to-date reference of configuration supported by Pilosa. Ideally this document should contain information about all public Pilosa configuration at all times.
To lay the foundation for for common terms used during configuration and define them.
To propose uniform names for configuration options in order to decrease developer surprises.
To clearly show the priorities between different configuration sources in order to have less surprises during the operation of Pilosa.

In Table 1 below, current configuration options and defaults are presented:

Description	Configuration File	Command Line Flag	Default
Configuration file	(N/A)	-config	-
Data directory	data-dir	-	~./pilosa
Host	host	-	127.0.0.1:15000
Number of replicas in the cluster	[cluster]/replicas	-	-
Cluster hosts	[[cluster.node]]/host	-	-
Cluster polling interval	[cluster]/polling-interval	-	60 seconds
Cluster messenger type	[cluster]/messenger-type	-	-
Cluster gossip seed address	[cluster.gossip]/seed	-	-
Cluster gossip port	[cluster.gossip]/port	-	-
Plugin path	[plugins.path]	-	-
Anti entropy interval	[anti-entropy]/interval	-	600 seconds
CPU profile	-	-cpuprofile	-
CPU profile duration	-	-cputime	30 seconds

Proposal

In order to have uniform meaning and representation we use the following terms in our proposal:

Directory: The configuration requires a directory. Aliases: dir, DIR
Path: The configuration requires a filename. Alias: PATH
Scheme: The protocol of the address.
Host: The domain name, hostname or IP address and the port. Alias: HOST
Port: Numerical port of the service: port. Alias: PORT
Address: Complete address of a service. Aliases: bind, BIND
- SCHEME://HOST:PORT form: Specify scheme, host and port
- HOST:PORT form: Specify host and port, use the default scheme
- HOST form: Specify the host and use the default port and scheme
- :PORT form: Specify the port and use the default host and scheme
- SCHEME://HOST form: Specify scheme, host and use the default port
- SCHEME://:PORT form: Specify scheme and port and use the default host

We renamed Host which used to mean an address in the current configuration and use HTTP Address instead. In Table 2 below, we summarized the proposed configuration, showing added or changed configuration in bold.

Description	Required	Type	Default	Notes
Configuration path	N	string	-	Must exist
Data directory	Y	string	`$HOME/.pilosa`
HTTP address	Y	string	127.0.0.1:10101
Other protocol address	N	string	-	Reserved
Number of replicas in the cluster	N	int	1
Cluster node addresses	N	list of addresses	empty list
Cluster polling interval	N	int	60 seconds
Cluster messenger type	N	string	-
Cluster gossip seed address	N	string	127.0.0.1:25000
Plugin directory	N	string	-	Must exist
Enabled plugins	N	list of strings	all plugin names if Plugins directory is specified otherwise empty list
Plugin configuration	N	sections	-	Each plugin may have a separate configuration section
Anti entropy interval	N	int	600 seconds
CPU profile	N	string	-
CPU profile duration	N	int	30 seconds
Log path	Y	string	stdout

Priority of Configuration Sources

The configuration maybe specified in the command line, in an environment variable, in the configuration file or the default for that configuration is used. In order to be able to specify the configuration, the priority between these sources should be defined. Below is our proposed priority of sources, where higher level sources override the lower ones:

Command line
Environment variable
Configuration file
Default

Command Line

One of the debates in the computing world is the number of dashes before a flag. BSD style utilities does not use any dashes and allows only single letter flags. GNU convention is to use double dashes (--) before the standard form of a flag and a single dash (-) before the alternative (short) form. Some software uses minus (-) to denote removal of a feature and plus (+) to denote addition. Java, Go and Erlang uses single dash before flags with their command line tools. In this proposal we opted for the GNU style flags, based solely on the observation that the quantity of modern software using that convention vastly outweighs the single dash style, and developers using modern UNIX and UNIX-like OSs would have a certain taste for it. It should be noted that Go’s standard flag parser doesn’t differentiate between single and double dashes.

In the light of the discussion above, always use lowercase flags with double dashes (--) to denote the long form and a single dash (-) to denote the alternative form. Only the most used flags should have an alternative form. Use dash (-) character as the word delimiter. Command line options should be preferably as short as possible without losing their meaning. Table 3 is below:

Description	Standard Form	Alternative Form	Notes
Configuration path	`--config`	`-c`
Data directory	`--data-dir`	`-d`
HTTP address	`--bind`	`-b`
Other protocol address	`--bind-PROTOCOL`	-	Reserved
Number of replicas in the cluster	`--cluster.replicas`	-
Cluster node addresses	`--cluster.hosts`	-	Use space as a delimiter in the form: `HOST1:PORT1 HOST2:PORT2`
Cluster polling interval	`--poll-interval`	-
Cluster messenger type	`--messenger`	-
Cluster gossip seed address	`--gossip`	-
Plugin directory	`--plugin-dir`	-
Enabled plugins	`--plugins`	-	Use space as a delimiter in the form: `PLUGIN1 PLUGIN2`
Plugin configuration	`--plugin-PLUGIN_NAME`	-	Reserved
Anti entropy interval	`--anti-entropy.interval`	-
CPU profile	`--profile.cpu`	-
CPU profile duration	`--profile.cpu-time`	-
Log path	`--log`	-

Configuration File

The configuration file is in the TOML format. Use lowercase section and key names. Use dash (-) character as the word delimiter. Table 4 is below:

Description	Section/Key	Notes
Configuration path	(N/A)
Data directory	data-dir
HTTP address	bind
Other protocol address	bind-PROTOCOL
Number of replicas in the cluster	[cluster]/replicas
Cluster node addresses	[cluster]/hosts	Array of addresses
Cluster polling interval	[cluster]/poll-interval
Cluster messenger type	[cluster]/messenger-type
Cluster gossip seed address	[cluster]/gossip-seed
Plugin directory	plugin-dir
Enabled plugins	enabled-plugins	Array of plugin names
Plugin configuration	[plugin.PLUGIN_NAME]	Section
Anti entropy interval	[anti-entropy].interval
CPU profile	[profile]/cpu
CPU profile duration	[profile]/cpu-time
Log path	log-path

Environment Variables

Most prominent deployment and orchestration tools such as, Puppet, Chef and Ansible also Docker support environment variables to pass configuration to a program. Moreover, environment variables are the preferred way of passing configuration for some application structuring conventions, like Twelve-Factor App

All environment variables are uppercase with underscore (_) used as the word delimiter. Some deployment tools (such as Puppet) seems to unable to set environment variables per process (only for the system). In order to avoid inadvertent configuration, PILOSA_ prefix must be used. Table 5 is below:

Description	Variable Name	Notes
Configuration path	PILOSA_CONFIG_PATH
Data directory	PILOSA_DATA_DIR
HTTP address	PILOSA_BIND
Other protocol address	PILOSA_BIND_protocol	Reserved
Number of replicas in the cluster	PILOSA_CLUSTER.REPLICAS
Cluster node addresses	PILOSA_CLUSTER.HOSTS
Cluster polling interval	PILOSA_CLUSTER.POLL_INTERVAL
Cluster messenger type	PILOSA_CLUSTER_MESSENGER_TYPE
Cluster gossip seed address	PILOSA_GOSSIP
Plugin directory	PILOSA_PLUGIN_DIR
Enabled plugins	PILOSA_PLUGINS
Plugin configuration	PILOSA_PLUGIN_plugin_name	Reserved
Anti entropy interval	PILOSA_ANTI_ENTROPY.INTERVAL
CPU profile	PILOSA_PROFILE.CPU
CPU profile duration	PILOSA_PROFILE.CPU_TIME
Log path	PILOSA_LOG_PATH

Implementation

The Config structure in config.go should be modified to match Table 2. (m *Main) ParseFlags(args []string) method in cmd/pilosa/main.go should be moved to config.go and become a method of Config. New command line flags should be added to that method. A new methods which reads configuration from environment variables should be added. Ideally, Config should have a method which reads from all configuration sources and applies the priorities mentioned in this document to the fields of Config.

yuce commented 7 years ago

Broken the Survey part of the proposal here, since that doesn't really fit into the proposal:

Survey

In this section we present a selection of configuration options supported by other databases.

CockroachDB

Some of the command line flags CockroachDB supports are:

--host: default localhost
--port: default 26257
--store: changes data store location, default: cockroach-data
--background: runs the server as a daemon
--join: joins node to a cluster, argument is in the form host:port

Environment variables are supported. A few examples:

CDB_DATA_DIR
CDB_CERTS_DIR

We found no information about how to configure CockroachDB with a configuration file.

InfluxDB

InfluxDB supports the following non-exhaustive list of flags:

-database
-host: default: localhost
-port: default: 8086

InfluxDB supports TOML formatted configuration files. Example:

[data]
  dir = "/var/lib/influxdb/data"
  query-log-enabled = true

Environment variables are supported in the form: INFLUXDB_config-section-name_option-name.

OrientDB

Following are some of the command line flags supported by OrientDB:

-h, --host: default localhost
-P, --ports: single port or port range, defaults to: 2424-2430
-u, --user: default: root
-p, --password: mandatory user password (default: root)

Configuration file is in XML format. Sample configuration:

<properties>
    <entry name="cache.size" value="10000" />
    <entry name="storage.keepOpen" value="true" />
  </properties>

Environment variables are supported. Below are some examples:

CONFIG_FILE
ORIENTDB_LOG_CONF
ORIENTDB_PID

Redis

Redis supports the following command line flags and more:

--port: default 6379
--bind: the interface to listen (default: 0.0.0.0)
--daemonize
--pidfile

Redis has a simple configuration file format, which lists the keys and values separated by whiteline. Same keys and values maybe separated on the commandline. Sample configuration:

daemonize no
pidfile /var/run/redis.pid
port 6379
bind 127.0.0.1

Redis does not support configuration using environment variables.

RethinkDB

Below is some of the flags supported by RethinkDB:

-d, --directory: The directory to store the data
--daemon: Run the server as a daemon
--log-file: Specify the log file
--config-file: Specify the configuration file
--bind: Specify the address to listen to, default localhost

RethinkDB uses a simple configuration file with configuration specified as KEY=VALUE lines. Here’s a sample:

pid-file=/var/run/rethinkdb/rethinkdb.pid
bind=127.0.0.1
cluster-port=29015

We found no information about environment variable support of RethinkDB.

jaffee commented 7 years ago

What's the plan for actually implementing the cascading configuration stuff? I know there are libraries which do a lot of this, but they have their own pros and cons. Any thoughts on using a library vs rolling our own?

yuce commented 7 years ago

@jaffee re: cascading, do you mean configuration priority ?

travisturner commented 7 years ago

Overall I really like the direction of this. There are a few things I might suggest changing to add clarity (for example, instead of -bind, use -bindaddr or bind-address or something like that).

Also, as for One of the debates in the computing world is the number of dashes before a flag... I fall on the "single dash" side of that debate.

jaffee commented 7 years ago

@yuce yes, by cascading I mean configuration priority

yuce commented 7 years ago

Single-dash/double-dash debate is mostly about tastes, so I guess we can't go wrong by picking any of them as long as we are consistent. Should we vote on that or any other way to resolve that?

The most important thing for me is consistently using a single flag anywhere an address is required (instead of specifying host and port).I've proposed bind since host is a bit overloaded, and when I hear that I immediately look for a port option (but host is very prevalent) . IMO bindaddr is a bit long, how about addr for HTTP and PROTOCOL-addr for other protocols?

yuce commented 7 years ago

@jaffee It never occurred to me there would already be libraries doing that. If there's something we can use, why not? Is there any you can recommend?

jaffee commented 7 years ago

@yuce I only have experience with viper - unfortunately it pulls in a lot of dependencies we don't need (they may have fixed this so you can opt out of them.) Might be a good starting point though.

yuce commented 7 years ago

Would using http instead of bind or host make sense? We could use PROTOCOL for other protocols, like protobuf:

$ pilosa -http localhost:5000 -protobuf localhost:6000

travisturner commented 7 years ago

@yuce can you investigate viper and let us know the pros/cons.

jaffee commented 7 years ago

Based on discussion on https://github.com/pilosa/pilosa/issues/273 around moving pilosactl commands under pilosa, we might also consider using viper's counterpart "cobra" which is a library for creating CLIs (which uses viper for config).

yuce commented 7 years ago

Viper looks good. It has about 10 dependencies, but I guess we can give it a try. Cobra for adding subcommands looks good too. Both libraries depend on pflag which is from the same developer. That library implements a Go flag compatible library supporting GNU style flags. I think we can make use of that too.

yuce commented 7 years ago

Updated the proposal with the following:

Directory: The configuration requires a directory. Aliases: dir, DIR
Path: The configuration requires a filename. Alias: PATH
Scheme: The protocol of the address.
Host: The domain name, hostname or IP address and the port. Alias: HOST
Port: Numerical port of the service: port. Alias: PORT
Address: Complete address of a service. Aliases: bind, BIND
- SCHEME://HOST:PORT form: Specify scheme, host and port
- HOST:PORT form: Specify host and port, use the default scheme
- HOST form: Specify the host and use the default port and scheme
- :PORT form: Specify the port and use the default host and scheme
- SCHEME://HOST form: Specify scheme, host and use the default port
- SCHEME://:PORT form: Specify scheme and port and use the default host

codysoyland commented 7 years ago

I reviewed all the discussion and overall I'm happy with Yuce's proposal as it exists currently. I'm not strongly opinionated either way on "-" vs. "--", but it seems that cobra chose the "--" route, so I'm ok with that. I think we should go ahead and plan on implementing this.

My only comment on the proposal is this. In table 2, the following configuration options are required: data directory, http address, and log path. Could we not choose sensible defaults ($HOME/.pilosa, 127.0.0.1:15000, and /dev/stdout) and not require them to be specified? That's how things currently work. It seems like we don't really have to have any required options.

jaffee commented 7 years ago

I agree @codysoyland

Running pilosa (or pilosa server assuming we go the subcommand route with cobra) should alway start pilosa. Don't require a new user to fumble with several flags just to get running for the first time.

jaffee commented 7 years ago

Ah, I had forgotten the decision from https://github.com/pilosa/pilosa/issues/273 was to use subcommands for sure.

yuce commented 7 years ago

Thanks for your comments. In table 2 http address and log path are required, but they have defaults (http://localhost:15000 and stdout respectively) so the user should only specify the data directory. The reason is, it maybe hard to determine a location which is standard/expected on all platforms (e.g., it maybe a bit strange to have the .pilosa directory on Windows (since there's no similar convention for naming hidden files there)). Also, I am not sure why the default directory should be hidden on UNIX platforms.

Do you guys have any suggestion for the default data dir? Should we just keep $HOME/.pilosa ?

Keeping pilosa a single executable and making use of subcommands makes a lot of sense. @jaffee Was there a decision on whether we would use pilosa server or pilosa run ? I will update the proposal accordingly.

I guess all the cards are on the table about using single or double dashes so I'll update the proposal according to @travisturner 's decision.

travisturner commented 7 years ago

double dashes is fine.

I don't think the user should have to specify the data directory. keeping the default to $HOME/.pilosa makes sense to me.

yuce commented 7 years ago

Updated the proposal with the default data directory set to $HOME/.pilosa.

jaffee commented 7 years ago

Removed needs-decision, as I think this is pretty ready-to-go. @yuce I'm happy to implement this, but if you want to do it, I think you have the right of first refusal given all your work on this proposal.

yuce commented 7 years ago

@jaffee I That's perfectly OK; you've already worked with viper/cobra, so you have more experience with it anyway. One thing that would be great to have is having some kind of testing for configuration, command line args, etc. (I have a few ideas about this, will try to write them down/implement a prototype later)

jaffee commented 7 years ago

@alanbernstein suggested changing the default pilosa port from 15000 to 10101. I'm going to do that unless there are objections.

yuce commented 7 years ago

I I think changing the port to 10101 is both fascinating and not very useful at the same time.

travisturner commented 7 years ago

@jaffee can you expand on the thinking behind that port change suggestion

jaffee commented 7 years ago

It's kind of funny? Since it's "binary".

That's really about it. @codysoyland mentioned that he wasn't a fan of 15000 and then alan said 10101 and we all thought that sounded perfect.

jaffee commented 7 years ago

Changed default cpu profile duration to 30 seconds (from 30 nanoseconds). 30 ns isn't really a useful amount of time to collect a profile.

Changed --data (cmd line flag) to --data-dir so that it matches configuration file.

The way I'm implementing this, the env variables, config file and cmd line are all going to have to match. (except that the env variables will be all caps, prefixed with PILOSA_, and any dashes will be underscores.

jaffee commented 7 years ago

Due to the way viper works, any command line flags which are represented in something nested in the config file, will have to be similarly nested with dots on the command line and in the environment, so

[cluster]
    hosts = ["localhost:15000","localhost:15001"]

will look like --cluster.hosts="localhost:15000,localhost:15001" on the command line and PILOSA_CLUSTER.HOSTS as an environment variable.

I will update the original ticket and catalog any edits in comments in case anyone takes issue with the changes.

jaffee commented 7 years ago

changed: replicas to cluster.replicas nodes to cluster.hosts --poll to --poll-interval antientropy to anti-entropy.interval profile to profile.cpu profile-duration to profile.cpu-time

(I will edit this comment with further changes)

yuce commented 7 years ago

Updated authors and changed the default port to 101010

yuce commented 7 years ago

@jaffee I thought https://github.com/pilosa/pilosa/pull/394 implemented some of this proposal, and remaining parts would be implemented in subsequent PRs, e.g., log, plugin and gossip related ones. Does it make sense to keep the ticket open until they are imlpemented?

jaffee commented 7 years ago

That's a good point @yuce, we should capture that.

I feel like that functionality is separate from the notion of "how do we do config" which is what this covers, and that we should break it out into separate tickets. Especially since some of it (like plugins) may not get done for a long time, and most of the work for those things will be outside of the config code.

Of course the relevant portions of this ticket will live on in the documentation, but I'd prefer we not leave it open getting stale for potentially many months.

I'll create a ticket for the log path stuff - I think the other two will be up to the implementers of that functionality to figure out what the best set of flags are that are needed to support it.

FeatureBaseDB / general