Open kishoreg opened 4 years ago
Proposal
Quickstart
Standalone
clustered mode
Ingest data
Irrespective of quickstart, standalone or clustered mode, the steps to ingest data will be the same. Create Table
bin/pinot-admin.sh add-table \
-table-config conf/table.json \
-table-schema conf/schema.json
Batch Ingestion
bin/pinot-ingestion-job.sh \
-job-spec conf/batch_ingestion_spec.json
This will help us restrict the documentation to
Thoughts?
I like the idea of providing a flexible, simplified and unified startup experience regardless of the phase type.
I have a simple additional requests which might be worth a dedicated topic on its own. Yet I'll mention it here because it might influence the design.
A <service>.config
file will indeed simplify the configuration management for documentation and code maintenance. Right now, args needs to be declared in the CLI options class as well as in a Commons Configuration properties class. I would like to suggest that any properties loaded from <service>.config
could be also passed as environment variable and CLI argument.
Config files are great for packaging config profiles that you ship with your deployment but are not friendly to Cloud Native deployments. For Docker and Kubernetes, they would require declaring and mounting volumes when a simple ENV VAR or CLI arg would have done the job. Pinot uses Apache Commons Configuration which supports Composite configuration properties classes that allows to define a hierarchy of sources for configuration. Done right, I think there is an importunity to support pluggable configurations provider (could even support config providers such as Vault or cloud KSM for instance).
In short, providing support and quick start examples based on config file is a good choice, but power users and clusters operators would certainly benefits from env var and cli args support since they would enable cleaner and simplified deployment manifest without requiring volumes mounts.
The Spring project implements an Environment
abstraction with prioritized config sources decoupled from the application code which should be a good inspiration to follow. Even Quarkus has that concept.
@daniellavoie Great points and it makes sense given we do have to consider Kubernetes/Docker
Let's try to summarize the different things we want to support
we can obviously enhance start-pinot-service to support other options in future.
If you are designing CLI here are the few suggestions,
Command should be #pinot in the terminal,
Usage:
pinot [command] [options]
Available Commands:
start
stop
Available Options:
batch (quickstart-batch)
stream (quickstart-stream)
hybrid (quickstart-hybrid)
zookeeper(standalone-zookeeper or clustered-zookeeper)
services(standalone-services)
controller(clustered-service)
broker(clustered-service)
server(clustered-service)
Flags:
-h, --help
-v, --verbose etc.
export PINOT_CONF=/some/location/<all-in-one.config> or <zoo.cfg> or <controller.config> or <broker.config> or <server.config>
export PINOT_MODE=quickstart / standalone / clustered
Use "pinot [command] [options] --help" for more information about a command.
Instead of passing the configurations to the commands we can export the configuration folder so based on the options choice, it can select expected configurations, this will be more flexible in the future for expanding to other options.
@satishbellapu, thanks. This is definitely better than what I suggested. @daniellavoie what do you think?
Definitely an upgrade! This is a lot more consistent to the experience you get in the go / node cli world.
I feel the CLI should start the service in foreground by default unless specified otherwise. As such, only the stop
command should only be accepted when a command like start-background
was used. Starting in foreground sounds more natural for Docker and Kubernetes deployments. Also, if you are starting the CLI as a developper, you get immediate output in your terminal without looking for log files anywhere.
As per my previous comment, I'm working on supporting an abstraction that decouples the property sources from the code base. This is a heavy refactoring that introduces a wrapper around the org.apache.commons.configuration.Configuration
used all over Pinot. This wrapper introduces a concept of CompositeConfiguration
that can accept variables from CLI args, environment variables and property files specified by config.paths
. This new wrapper also introduces a concept of relaxed binding. What it means is that it knows how to translate environment variables such as PINOT_CONTROLER_PORT
to properties controller.port
. When properties are inserted and fetched from the environment, they go through relaxing routine that standardize their key so that lookup such
as config.getPropert("controller.port")
return the value even if it was defined by the PINOT_CONTROLER_PORT
env variable . As I exposed in my previous comment, this is intended to offer a transparent mechanism to drive Pinot configuration without requiring config files that are hard to reference in a deployment manifest without volumes.
Given the new CLI. I would think it would be nice if any property loaded in org.apache.commons.configuration.Configuration
could be passed as argument in the startup command, e.g.:
$ pinot start controller --help
Usage:
pinot start controller [options]
Available Options:
--help, -h
--debug
--controler.port=<port>
--controler.host=<hostname>
--cluster-name=<cluster-name>
--[etc]
...
Refer to Pinot Documentation at https://docs.pinot.apache.org/basics/components/controller#starting-a-controller for full reference of all available properties for the Controller.
Now, if we want to allow any Pinot config properties to be passed through a pinot
cli, I feel we might have to drop args4j
since it requires any CLI argument to be explicitly annotated in a SubCommand
object. I think overall, Pinot has more than 100 properties used all over the place and does not feel like it would be maintainable over time to require manual binding between a SubCommand
class and the org.apache.commons.configuration.Configuration
class. Rewriting a custom CLI with his own main startup class would grant access to the main args
and let us parse all properties and inject them in the new CompositeConfiguration
to merge them with the env variables and explicit config files.
Last but not least, it would also remove the property validation made by the args4j
annotation properties. With this mechanism, we have inconsistent validation between what is passed by the CLI and what is passed as config file properties. This is an opportunity to centralize validation regardless of the property source.
@daniellavoie few comments,
Foreground or background :
I feel Foreground option will not be much useful unless for the quick start or someone trying the system for the first timers, because you have to close the services and relaunch in background for long running the system, instead of that we can support only one mode which is background and add the following command for debugging or exploration purpose,
pinot start [options] //always launch the services background
pinot log [options] //ex., pinot log zookeeper
Should tail the zookeeper logs for debug purpose and also not required to close the session and relaunch the for continues running, CLI will not be restricted for development only can be used for production as well.
After rechecking @kishoreg requirements for ingest, we can add load command to the CLI,
Ex.. pinot load [load.config]
load-file.config
//content
table.name : [table-name]
data-to-load : [pathc/batch_ingestion_spec.json]
This way the tool can be used for multipurpose and also can launch commands one after the other without killing the CLI and that will be so helpful for having auto launcher scripts by support teams/devOps etc.
Available Commands:
start
stop
log
load
Passing Pinot properties to the CLI.
I feel adding options like --controler.port=<port> or --cluster-name=<cluster-name>
to the CLI will be bad design and can’t be scalable, we should restrict the options at service level like,
Options:
zookeeper
broker
controller etc
And the options like brokers ulr, hosts or ports etc should be part of the properties/config file, that way you can still parse by args4j annotation properties and validate before passing to the systems, and at CLI level you only need to validate on predefine commands and options.
Foreground or background :
I feel Foreground option will not be much useful unless for the quick start or someone trying the system for the first timers, because you have to close the services and relaunch in background for long running the system, instead of that we can support only one mode which is background and add the following command for debugging or exploration purpose,
I have to disagree here. We should pause this conversation and take a step back and set ourselves goals. One of the big theme that I would like to see improvement is to make Pinot more friendly to Cloud Native deployments. If you want to run something in a container based environnement, it needs to run in foreground to fully embraces all the container lifecycle management offered by the container orchestrator. Logging management is also simplified because your container orchestrator only needs to retreive the stdout and stderr to forward to a logging platform. As for configuration management; environment variables and cli arguments composes the best practices to operate Cloud Native services.
I do agree that there is use cases to run Pinot a a traditional Service Daemon. Once you have a well designed foreground service launcher, it is really easy to wrap it with a service package (such as Systemd) that turns it in a background process. Let me be clear, we have to offer an option for operators to run Pinot in foreground. If our base operating model is background, then we have to retrofit foreground semantics and it will an awkward exercice. It a lot easier to turn a foreground process to background.
I feel adding options like --controler.port=
or --cluster-name= to the CLI will be bad design and can’t be scalable, we should restrict the options at service level like. And the options like brokers ulr, hosts or ports etc should be part of the properties/config file, that way you can still parse by args4j annotation properties and validate before passing to the systems, and at CLI level you only need to validate on predefine commands and options.
As I explained earlier, property files are not what best practices guide to operate Cloud Native software. Works great for local dev environment but is not what is recommended for operating software on Cloud platforms. This is not bad design. Modern cloud native software all adopt this pattern. I already have a prototype of Pinot that uses this configuration pattern in a scalable way. Maintaining specific binding through args4j
and manually wiring them the system configuration is, in my opinion, not scalable. Pinot should offer the best "ready to go" default properties and let you overwrite them the way that pleases you.
@satishbellapu, my understanding is that you wish to offer a better "Pinot in a box" experience for developers who wish to iterate quickly on their environment. I think there is a lot of value there. My conclusion is that maybe we should have separate story to align when these two different goals. I honestly think they should be materialized in two separate tools. The operator focused pinot
cli that aims to offer maximum flexibility for operators to run Pinot Services with freedom of choice regarding their target platform. And a pinot-dev
cli that downloads, install dependencies and runs Pinot Services in an opinionated way to get productive. Under the hood, it would make a lot of sense for it to run pinot with the "lower level" pinot
CLI. What do you think of that suggestion?
We should pause this conversation and take a step back and set ourselves goals.
Yes I agree, I was recommending the CLI, for non-container based deployments (not for docker/kube/gcp/az etc.), I was trying to improve the initial proposals from @kishoreg they are more specific requirements towards the local cluster than cloud.
One of the big theme that I would like to see improvement is to make Pinot more friendly to Cloud Native deployments.
I am not sure you really want to have a wrappers on top of kubctl/docker/aws/gcp etc I see good documentation with cloud native deployments from pinot, and more specifically users will be more familiar with the kubctl and eksctl cmds. (sorry for my ignorance if this is totally not making sense, understandable because I am new to the pinot and not sure of the usability issues.)
my understanding is that you wish to offer a better "Pinot in a box" experience for developers who wish to iterate quickly on their environment. I think there is a lot of value there. My conclusion is that maybe we should have separate story to align when these two different goals
Yes totally agree, container based deployment vs non container base deployments should have different CLI’s.
There are a lot of good ideas in this thread. @daniellavoie, can you please summarize and edit the description. We can tag the requirements in as phase1, phase 2 etc
Users typically go through 3 phases.
Today, each phase has a different set of scripts which makes it hard to write the user documentation.
We should refactor some of the commands to make this process simpler.
The current set of scripts are not articulated around personas. You currently need to pick a script based on how you want to run Pinot. How you intended to run Pinot (quickstart, standalone and cluster) does not change how you expect the service behaves.
A good example of an opinionated behavior based on your persona is if Pinot should run in background or foreground. Developers would prefer a "background service oriented" mode where best practice for operations would be to run as foreground. When running in foreground, you can wrap the process with any service manager or container orchestrator that is standard to the operator's organization.
This new pinot
command is basically a rename of the existing pinot-admin.sh
only that it stripped for the notion of start mode (quickstart, standalone). It is focused on running cluster node in foreground with maximum configuration options. This is the command that would be used by the Pinot Docker image.
pinot --help
Usage:
pinot SERVICE [options]
Starts a Pinot service
Available Service:
zookeeper
broker
server
controller
Available Options:
--help, -h
--config.paths=<Pinot-config-file>
All Pinot properties may be passed as CLI arguments,environment variables or property files.
Refer to the Pinot reference documentation for a
complete list of all available properties.
This Start/Stop command refactoring is also an opportunity to support all Pinot properties to be as CLI arguments and Environment Variables rather than only config files. See PR #5608 which introduce the ability for Pinot to load property from independent configuration sources. Additional phase will be required to plug cli arguments and environments variable to Pinot configuration management.
This new pinot-local
is a developer focused wrapper around the pinot
CLI. It main difference is that it acts as a service manager that can run Pinot components in background. It is intended to be cross platform, and not tied to any OS specific service manager. By default, state would be persistent in a .pinot
folder in the home directory of the executing user.
$ pinot-local --help
Usage:
pinot-local [generic-args] COMMAND
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Available commands:
start
stop
logs
add-table
add-ingest-job
$ pinot-local start --help
Starts local Zookeeper and Pinot instances. State of these services will be tracked in the provided state-dir
Usage:
pinot-local [generic-args] start START_MODE
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Required arguments:
START_MODE 'quickstart' or 'standalone'
$ Pinot quickstart batch
$ pinot-local start quickstart --help
Starts local Zookeeper and Pinot instances with preloaded sample data. State of these services will be tracked in the provided state-dir
Usage:
pinot-local [generic-args] start quickstart MODE
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Available Modes:
batch
stream
hybrid
$ pinot-local start quickstart batch
$ pinot-local start standalone --help
Usage:
pinot-local [generic-args] start standalone [options]
Starts local Zookeeper and Pinot instances. State of these services will be tracked in the provided state-dir
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Available options:
--zookeeper-config-file Path to a zookeeper configuration file
--config-file Path to a Pinot configuration file for all services.
$ pinot-local standalone --zookeeper-config-file=zoo.cfg --config-file=pinot.properties
$ pinot-local stop --help
Usage:
pinot-local [generic-args] stop [options]
Stops the local Zookeeper and Pinot instances found in the provided state-dir.
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Available options:
--zookeeper-config-file Path to a zookeeper configuration file
--config-file Path to a Pinot configuration file for all services.
$ pinot-local standalone --zookeeper-config-file=zoo.cfg --config-file=pinot.properties
$ pinot-local logs --help
Usage:
pinot-local [generic-args] logs SERVICE [options]
Prints logs of the specified service
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Available Service:
zookeeper
broker
server
controller
Available options:
--follow, -f Follow log output
$ pinot-local add-table --help
Usage:
pinot-local [generic-args] add-table TABLE_CONFIG TABLE_SCHEMA
Creates a new Pinot table out of a table config and a table schema
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Required Arguments:
TABLE_CONFIG: Path to the table config json
TABLE_SCHEMA: Path to the table schema json
$ pinot-local add-ingest-job --help
Usage:
pinot-local [generic-args] add-ingest-job JOB_SPEC
Creates a new batch ingestion job
Generic arguments:
--state-dir State directory of the Pinot instance
Defaults to `~/.pinot`
Switch between Pinot instances by changing directory
Required Arguments:
JOB_SPEC: Path to the ingestion job json spec json
Users typically go through 3 phases
Today, each phase has a different set of scripts which makes it hard to write the user documentation.
We should refactor some of the commands to make this process simpler.