apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Refactor start/stop commands #5580

Open kishoreg opened 4 years ago

kishoreg commented 4 years ago

Users typically go through 3 phases

Today, each phase has a different set of scripts which makes it hard to write the user documentation.

We should refactor some of the commands to make this process simpler.

kishoreg commented 4 years ago

Proposal

Quickstart

Standalone

clustered mode

Ingest data

Irrespective of quickstart, standalone or clustered mode, the steps to ingest data will be the same. Create Table

bin/pinot-admin.sh add-table \
        -table-config conf/table.json \
        -table-schema conf/schema.json

Batch Ingestion

bin/pinot-ingestion-job.sh \
   -job-spec conf/batch_ingestion_spec.json 

This will help us restrict the documentation to

Thoughts?

daniellavoie commented 4 years ago

I like the idea of providing a flexible, simplified and unified startup experience regardless of the phase type.

I have a simple additional requests which might be worth a dedicated topic on its own. Yet I'll mention it here because it might influence the design.

A <service>.config file will indeed simplify the configuration management for documentation and code maintenance. Right now, args needs to be declared in the CLI options class as well as in a Commons Configuration properties class. I would like to suggest that any properties loaded from <service>.config could be also passed as environment variable and CLI argument.

Config files are great for packaging config profiles that you ship with your deployment but are not friendly to Cloud Native deployments. For Docker and Kubernetes, they would require declaring and mounting volumes when a simple ENV VAR or CLI arg would have done the job. Pinot uses Apache Commons Configuration which supports Composite configuration properties classes that allows to define a hierarchy of sources for configuration. Done right, I think there is an importunity to support pluggable configurations provider (could even support config providers such as Vault or cloud KSM for instance).

In short, providing support and quick start examples based on config file is a good choice, but power users and clusters operators would certainly benefits from env var and cli args support since they would enable cleaner and simplified deployment manifest without requiring volumes mounts.

The Spring project implements an Environment abstraction with prioritized config sources decoupled from the application code which should be a good inspiration to follow. Even Quarkus has that concept.

kishoreg commented 4 years ago

@daniellavoie Great points and it makes sense given we do have to consider Kubernetes/Docker

Let's try to summarize the different things we want to support

we can obviously enhance start-pinot-service to support other options in future.

satishbellapu commented 4 years ago

If you are designing CLI here are the few suggestions,

Command should be #pinot in the terminal,

Usage:
    pinot [command] [options]

Available Commands:
    start
    stop

Available Options:
    batch (quickstart-batch)
    stream (quickstart-stream)
    hybrid (quickstart-hybrid)
    zookeeper(standalone-zookeeper or clustered-zookeeper)
    services(standalone-services)
    controller(clustered-service)
    broker(clustered-service)
    server(clustered-service)

Flags:
    -h, --help
    -v, --verbose etc.
export PINOT_CONF=/some/location/<all-in-one.config> or <zoo.cfg> or <controller.config> or <broker.config> or <server.config>
export PINOT_MODE=quickstart / standalone / clustered

Use "pinot [command] [options] --help" for more information about a command.

Instead of passing the configurations to the commands we can export the configuration folder so based on the options choice, it can select expected configurations, this will be more flexible in the future for expanding to other options.

kishoreg commented 4 years ago

@satishbellapu, thanks. This is definitely better than what I suggested. @daniellavoie what do you think?

daniellavoie commented 4 years ago

Definitely an upgrade! This is a lot more consistent to the experience you get in the go / node cli world.

Foreground or background

I feel the CLI should start the service in foreground by default unless specified otherwise. As such, only the stop command should only be accepted when a command like start-background was used. Starting in foreground sounds more natural for Docker and Kubernetes deployments. Also, if you are starting the CLI as a developper, you get immediate output in your terminal without looking for log files anywhere.

Feeding properties from multiple sources.

As per my previous comment, I'm working on supporting an abstraction that decouples the property sources from the code base. This is a heavy refactoring that introduces a wrapper around the org.apache.commons.configuration.Configuration used all over Pinot. This wrapper introduces a concept of CompositeConfiguration that can accept variables from CLI args, environment variables and property files specified by config.paths. This new wrapper also introduces a concept of relaxed binding. What it means is that it knows how to translate environment variables such as PINOT_CONTROLER_PORT to properties controller.port. When properties are inserted and fetched from the environment, they go through relaxing routine that standardize their key so that lookup such as config.getPropert("controller.port") return the value even if it was defined by the PINOT_CONTROLER_PORT env variable . As I exposed in my previous comment, this is intended to offer a transparent mechanism to drive Pinot configuration without requiring config files that are hard to reference in a deployment manifest without volumes.

Passing Pinot properties to the CLI.

Given the new CLI. I would think it would be nice if any property loaded in org.apache.commons.configuration.Configuration could be passed as argument in the startup command, e.g.:

$ pinot start controller --help

Usage:
    pinot start controller [options]

Available Options:
    --help, -h
    --debug
    --controler.port=<port>
    --controler.host=<hostname>
    --cluster-name=<cluster-name>
    --[etc]
    ...

   Refer to Pinot Documentation at https://docs.pinot.apache.org/basics/components/controller#starting-a-controller for full reference of all available properties for the Controller.

Now, if we want to allow any Pinot config properties to be passed through a pinot cli, I feel we might have to drop args4j since it requires any CLI argument to be explicitly annotated in a SubCommand object. I think overall, Pinot has more than 100 properties used all over the place and does not feel like it would be maintainable over time to require manual binding between a SubCommand class and the org.apache.commons.configuration.Configuration class. Rewriting a custom CLI with his own main startup class would grant access to the main args and let us parse all properties and inject them in the new CompositeConfiguration to merge them with the env variables and explicit config files.

Last but not least, it would also remove the property validation made by the args4j annotation properties. With this mechanism, we have inconsistent validation between what is passed by the CLI and what is passed as config file properties. This is an opportunity to centralize validation regardless of the property source.

satishbellapu commented 4 years ago

@daniellavoie few comments,

Foreground or background :

I feel Foreground option will not be much useful unless for the quick start or someone trying the system for the first timers, because you have to close the services and relaunch in background for long running the system, instead of that we can support only one mode which is background and add the following command for debugging or exploration purpose,

pinot start [options] //always launch the services background
pinot log [options] //ex., pinot log zookeeper 

Should tail the zookeeper logs for debug purpose and also not required to close the session and relaunch the for continues running, CLI will not be restricted for development only can be used for production as well.

After rechecking @kishoreg requirements for ingest, we can add load command to the CLI,

Ex.. pinot load [load.config]

 load-file.config
 //content
 table.name : [table-name] 
 data-to-load : [pathc/batch_ingestion_spec.json]

This way the tool can be used for multipurpose and also can launch commands one after the other without killing the CLI and that will be so helpful for having auto launcher scripts by support teams/devOps etc.

Available Commands:
    start
    stop
    log
    load

Passing Pinot properties to the CLI.

I feel adding options like --controler.port=<port> or --cluster-name=<cluster-name> to the CLI will be bad design and can’t be scalable, we should restrict the options at service level like,

Options:
     zookeeper 
     broker 
     controller etc

And the options like brokers ulr, hosts or ports etc should be part of the properties/config file, that way you can still parse by args4j annotation properties and validate before passing to the systems, and at CLI level you only need to validate on predefine commands and options.

daniellavoie commented 4 years ago

Foreground or background :

I feel Foreground option will not be much useful unless for the quick start or someone trying the system for the first timers, because you have to close the services and relaunch in background for long running the system, instead of that we can support only one mode which is background and add the following command for debugging or exploration purpose,

I have to disagree here. We should pause this conversation and take a step back and set ourselves goals. One of the big theme that I would like to see improvement is to make Pinot more friendly to Cloud Native deployments. If you want to run something in a container based environnement, it needs to run in foreground to fully embraces all the container lifecycle management offered by the container orchestrator. Logging management is also simplified because your container orchestrator only needs to retreive the stdout and stderr to forward to a logging platform. As for configuration management; environment variables and cli arguments composes the best practices to operate Cloud Native services.

I do agree that there is use cases to run Pinot a a traditional Service Daemon. Once you have a well designed foreground service launcher, it is really easy to wrap it with a service package (such as Systemd) that turns it in a background process. Let me be clear, we have to offer an option for operators to run Pinot in foreground. If our base operating model is background, then we have to retrofit foreground semantics and it will an awkward exercice. It a lot easier to turn a foreground process to background.

I feel adding options like --controler.port= or --cluster-name= to the CLI will be bad design and can’t be scalable, we should restrict the options at service level like. And the options like brokers ulr, hosts or ports etc should be part of the properties/config file, that way you can still parse by args4j annotation properties and validate before passing to the systems, and at CLI level you only need to validate on predefine commands and options.

As I explained earlier, property files are not what best practices guide to operate Cloud Native software. Works great for local dev environment but is not what is recommended for operating software on Cloud platforms. This is not bad design. Modern cloud native software all adopt this pattern. I already have a prototype of Pinot that uses this configuration pattern in a scalable way. Maintaining specific binding through args4j and manually wiring them the system configuration is, in my opinion, not scalable. Pinot should offer the best "ready to go" default properties and let you overwrite them the way that pleases you.

@satishbellapu, my understanding is that you wish to offer a better "Pinot in a box" experience for developers who wish to iterate quickly on their environment. I think there is a lot of value there. My conclusion is that maybe we should have separate story to align when these two different goals. I honestly think they should be materialized in two separate tools. The operator focused pinot cli that aims to offer maximum flexibility for operators to run Pinot Services with freedom of choice regarding their target platform. And a pinot-dev cli that downloads, install dependencies and runs Pinot Services in an opinionated way to get productive. Under the hood, it would make a lot of sense for it to run pinot with the "lower level" pinot CLI. What do you think of that suggestion?

satishbellapu commented 4 years ago

We should pause this conversation and take a step back and set ourselves goals.

Yes I agree, I was recommending the CLI, for non-container based deployments (not for docker/kube/gcp/az etc.), I was trying to improve the initial proposals from @kishoreg they are more specific requirements towards the local cluster than cloud.

One of the big theme that I would like to see improvement is to make Pinot more friendly to Cloud Native deployments.

I am not sure you really want to have a wrappers on top of kubctl/docker/aws/gcp etc I see good documentation with cloud native deployments from pinot, and more specifically users will be more familiar with the kubctl and eksctl cmds. (sorry for my ignorance if this is totally not making sense, understandable because I am new to the pinot and not sure of the usability issues.)

my understanding is that you wish to offer a better "Pinot in a box" experience for developers who wish to iterate quickly on their environment. I think there is a lot of value there. My conclusion is that maybe we should have separate story to align when these two different goals

Yes totally agree, container based deployment vs non container base deployments should have different CLI’s.

kishoreg commented 4 years ago

There are a lot of good ideas in this thread. @daniellavoie, can you please summarize and edit the description. We can tag the requirements in as phase1, phase 2 etc

daniellavoie commented 4 years ago

Refactor start/stop commands

Table of Content

Motivation

Users typically go through 3 phases.

Today, each phase has a different set of scripts which makes it hard to write the user documentation.

We should refactor some of the commands to make this process simpler.

Goals

Anti-Goals

Current state

The current set of scripts are not articulated around personas. You currently need to pick a script based on how you want to run Pinot. How you intended to run Pinot (quickstart, standalone and cluster) does not change how you expect the service behaves.

A good example of an opinionated behavior based on your persona is if Pinot should run in background or foreground. Developers would prefer a "background service oriented" mode where best practice for operations would be to run as foreground. When running in foreground, you can wrap the process with any service manager or container orchestrator that is standard to the operator's organization.

Proposal

Step 1 - A unified startup command for operators

This new pinot command is basically a rename of the existing pinot-admin.sh only that it stripped for the notion of start mode (quickstart, standalone). It is focused on running cluster node in foreground with maximum configuration options. This is the command that would be used by the Pinot Docker image.

Command

pinot --help

Usage:
    pinot SERVICE [options]

Starts a Pinot service

Available Service:
  zookeeper
  broker
  server
  controller

Available Options:
    --help, -h
    --config.paths=<Pinot-config-file>

    All Pinot properties may be passed as CLI arguments,environment variables or property files.
    Refer to the Pinot reference documentation for a 
    complete list of all available properties.

Parallel initiative - Configuration management refactoring

This Start/Stop command refactoring is also an opportunity to support all Pinot properties to be as CLI arguments and Environment Variables rather than only config files. See PR #5608 which introduce the ability for Pinot to load property from independent configuration sources. Additional phase will be required to plug cli arguments and environments variable to Pinot configuration management.

Step 2 - Implement a developer focused CLI

This new pinot-local is a developer focused wrapper around the pinot CLI. It main difference is that it acts as a service manager that can run Pinot components in background. It is intended to be cross platform, and not tied to any OS specific service manager. By default, state would be persistent in a .pinot folder in the home directory of the executing user.

Features

Optional Features

Commands

Help

$ pinot-local --help

Usage:
    pinot-local [generic-args] COMMAND

Generic arguments:
  --state-dir   State directory of the Pinot instance
                Defaults to `~/.pinot`
                Switch between Pinot instances by changing directory

Available commands:
  start
  stop
  logs
  add-table
  add-ingest-job

Start

$ pinot-local start --help

Starts local Zookeeper and Pinot instances. State of these services will be tracked in the provided state-dir

Usage:
    pinot-local [generic-args] start START_MODE

Generic arguments:
  --state-dir   State directory of the Pinot instance
                Defaults to `~/.pinot`
                Switch between Pinot instances by changing directory

Required arguments:
    START_MODE  'quickstart' or 'standalone'

$ Pinot quickstart batch
Quickstart
$ pinot-local start quickstart --help

Starts local Zookeeper and Pinot instances with preloaded sample data. State of these services will be tracked in the provided state-dir

Usage:
    pinot-local [generic-args] start quickstart MODE

Generic arguments:
  --state-dir   State directory of the Pinot instance
                Defaults to `~/.pinot`
                Switch between Pinot instances by changing directory

Available Modes:
    batch
    stream
    hybrid

$ pinot-local start quickstart batch

Standalone

$ pinot-local start standalone --help

Usage:
    pinot-local [generic-args] start standalone [options]

Starts local Zookeeper and Pinot instances. State of these services will be tracked in the provided state-dir

Generic arguments:
  --state-dir                 State directory of the Pinot instance
                              Defaults to `~/.pinot`
                              Switch between Pinot instances by changing directory

Available options:
    --zookeeper-config-file   Path to a zookeeper configuration file
    --config-file             Path to a Pinot configuration file for all services.

$ pinot-local standalone --zookeeper-config-file=zoo.cfg --config-file=pinot.properties

Stop

$ pinot-local stop --help

Usage:
    pinot-local [generic-args] stop [options]

Stops the local Zookeeper and Pinot instances found in the  provided state-dir.

Generic arguments:
  --state-dir                 State directory of the Pinot instance
                              Defaults to `~/.pinot`
                              Switch between Pinot instances by changing directory

Available options:
    --zookeeper-config-file   Path to a zookeeper configuration file
    --config-file             Path to a Pinot configuration file for all services.

$ pinot-local standalone --zookeeper-config-file=zoo.cfg --config-file=pinot.properties

Logs

$ pinot-local logs --help

Usage:
    pinot-local [generic-args] logs SERVICE [options]

Prints logs of the specified service

Generic arguments:
  --state-dir                 State directory of the Pinot instance
                              Defaults to `~/.pinot`
                              Switch between Pinot instances by changing directory

Available Service:
  zookeeper
  broker
  server
  controller

Available options:
    --follow, -f              Follow log output

Add table

$ pinot-local add-table --help

Usage:
    pinot-local [generic-args] add-table TABLE_CONFIG TABLE_SCHEMA

Creates a new Pinot table out of a table config and a table schema

Generic arguments:
  --state-dir       State directory of the Pinot instance
                    Defaults to `~/.pinot`
                    Switch between Pinot instances by changing directory

Required Arguments:
  TABLE_CONFIG:     Path to the table config json
  TABLE_SCHEMA:     Path to the table schema json

Add an ingestion Job

$ pinot-local add-ingest-job --help

Usage:
    pinot-local [generic-args] add-ingest-job JOB_SPEC

Creates a new batch ingestion job

Generic arguments:
  --state-dir       State directory of the Pinot instance
                    Defaults to `~/.pinot`
                    Switch between Pinot instances by changing directory

Required Arguments:
  JOB_SPEC:         Path to the ingestion job json spec json