Open mcvsubbu opened 4 years ago
+1 On what @mcvsubbu stated above, after going through and using the tool myself, here's some additional thoughts
If the table config contains the stream configs then it should be able to use these to start up a pinot-server instance and consume. (Perhaps this is harder than I estimate though)
One caveat here is that it traffic on the stream might vary, so consuming starting from the highest offset might not give reliable results (ie - consuming only as it become available). What happens if we are running the tool with this "live consumption" feature during a period where the event stream has abnormally low traffic, the estimation wouldn't be representative.
If we could consume from the smallest offset (ie - consume historical data), we could see how it would perform over a longer sample period and gather more data. Additionally, this method would allow the tool to be run now and process the historical data rather than having it consume for a few hours to get enough "new" events (as in consuming from largest)
We could create the table in pinot-server using "off-the-shelf" default options. Every so often (few hours, at segment close, etc) pinot-server could analyze itself and then output a matrix similar to what this tool outputs, in the current logs, create a new log, something. Operationally, this would allow us to create a table, then come back after a day or two to check the logs and have a recommendation waiting for us.
This could open up potentially expanding the auto-tuning functionality. Pinot would be able to know how many instances are present and auto tune for that instance count. If we could output a metric to act on if we are not in an "optimal zone" for that number of segments, we could act on this metric with auto-scaling the number of instances up or down. By scaling instances up or down the segment sizing could auto-tune to adapt to the new instance count (perhaps this is a cluster level config to enable this feature)
When using an offline segment for estimation, this was kind of clunky - Specify the segment, then figure out the number of rows in the segment, divide by the event rate, specify the result of this division as the time period for the segment.
We could potentially specify we are passing an offline segment, also allowing the "Average Event Rate" to be passed as an argument might simplify manual steps for the user. (Reduces the level of understanding needed and misc math to be done when figuring out what the arguments should be)
Ideally I should be able to not know much about how the tool works, and just pass some trivial arguments (IE, here's a segment and X) then it can interpolate needed data from there.
More comments after using recent version:
RealtimeProvisioningHelperCommand