apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Documentation/tool improvements #5588

Open mcvsubbu opened 4 years ago

mcvsubbu commented 4 years ago
jgutmann commented 4 years ago

+1 On what @mcvsubbu stated above, after going through and using the tool myself, here's some additional thoughts

  1. It might be interesting if we could point the tool at a stream and the tool could consume its own segment from the stream.

If the table config contains the stream configs then it should be able to use these to start up a pinot-server instance and consume. (Perhaps this is harder than I estimate though)

One caveat here is that it traffic on the stream might vary, so consuming starting from the highest offset might not give reliable results (ie - consuming only as it become available). What happens if we are running the tool with this "live consumption" feature during a period where the event stream has abnormally low traffic, the estimation wouldn't be representative.

If we could consume from the smallest offset (ie - consume historical data), we could see how it would perform over a longer sample period and gather more data. Additionally, this method would allow the tool to be run now and process the historical data rather than having it consume for a few hours to get enough "new" events (as in consuming from largest)

  1. What if we could build some kind of recommendation engine into pinot-server itself?

We could create the table in pinot-server using "off-the-shelf" default options. Every so often (few hours, at segment close, etc) pinot-server could analyze itself and then output a matrix similar to what this tool outputs, in the current logs, create a new log, something. Operationally, this would allow us to create a table, then come back after a day or two to check the logs and have a recommendation waiting for us.

This could open up potentially expanding the auto-tuning functionality. Pinot would be able to know how many instances are present and auto tune for that instance count. If we could output a metric to act on if we are not in an "optimal zone" for that number of segments, we could act on this metric with auto-scaling the number of instances up or down. By scaling instances up or down the segment sizing could auto-tune to adapt to the new instance count (perhaps this is a cluster level config to enable this feature)

  1. Give better command line options for making the tool more intuitive.

When using an offline segment for estimation, this was kind of clunky - Specify the segment, then figure out the number of rows in the segment, divide by the event rate, specify the result of this division as the time period for the segment.

We could potentially specify we are passing an offline segment, also allowing the "Average Event Rate" to be passed as an argument might simplify manual steps for the user. (Reduces the level of understanding needed and misc math to be done when figuring out what the arguments should be)

Ideally I should be able to not know much about how the tool works, and just pass some trivial arguments (IE, here's a segment and X) then it can interpolate needed data from there.

mcvsubbu commented 4 years ago

More comments after using recent version:

  1. Change the documentation so it does not say RealtimeProvisioningHelperCommand
  2. Add a link to doc in the help message
  3. Clarify the number of segments searched is per partition or per host.