apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.4k stars 1.26k forks source link

Reduce ingestion latency during segment sealing #6121

Open yupeng9 opened 3 years ago

yupeng9 commented 3 years ago

We have some use cases that demand very fresh data. For example, display events aggregates of the last minute for decision making.

However, as shown in the diagram below, the ingestion latency spikes during segment sealing, which breaches the SLA image

The segment size is 5MB, so in this particular case, it seals every a couple of hours. We could try smaller segment sizes, but that may result in different issues like too many zookeeper nodes.

The solution to this could be more optimistic on new segment creation: the new segments start ingestion without waiting for the sealing. However, this can complicate the state machine, as the segment cleanup is more complicated when the segment commit fails.

Any other thoughts to mitigate this issue?

xiangfu0 commented 3 years ago

@npawar @mcvsubbu

kishoreg commented 3 years ago

This is something we saw in other users as well and we need to figure out a way to support this. However, you seem to experience this at a low segment size. Feels like the system is overloaded with something else.

yupeng9 commented 3 years ago

Correct. The problem exaggerates given the data volume is high.

image

kishoreg commented 3 years ago

something is not adding up, its 2k msgs/per sec but segments of size 5mb are getting flushed every 2 hours?

yupeng9 commented 3 years ago

There is some approximation. Each data point is the max value of the 1 min window. And each line maps to a partition.

mcvsubbu commented 3 years ago

Did you run the provisioning helper tool? Can you paste the output here? (See https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime)

That being said, there is always a risk of excessive GC load during segment completion. What we have seen is that all the memory that is accumulated during consumption is released when the segment is completed. Also, segment build takes up cpu, causing queries to get slower.

There is a config (on the server) to limit the number of segment builds per server. Default is no limit, but you can try to set it to some limit if you like.

Also, 5MB segment sizes seem too small to me, but of course, it depends on what kind of hosts you are provisioning. Running the prov helper and pasting the output here will guide us some.

yupeng9 commented 3 years ago

@mcvsubbu No, we have not done that yet. Can look into it But these are the JVM metrics of one server of this table. Note that this server also ingests many other tables. image

Do these charts look sane to you?

yupeng9 commented 3 years ago

Some more metrics image

image

This is for the past 24 hours

yupeng9 commented 3 years ago

hmm seems this tool is only v0.4 and later

mcvsubbu commented 3 years ago

The tool was there as of 0.1

You can download the latest version and run the tool (it is by itself as a command line). You don't need to be running the cluster at that version.

yupeng9 commented 3 years ago

I see. Thanks @mcvsubbu

This is the output of run on one sample segment

Memory used per host

numHosts --> 2        |4        |6        |8        |10       |12       |14       |16       |
numHours
 2 --------> 8.5G     |4.25G    |2.83G    |2.83G    |2.83G    |1.42G    |1.42G    |1.42G    |
 3 --------> 8.73G    |4.36G    |2.91G    |2.91G    |2.91G    |1.45G    |1.45G    |1.45G    |
 4 --------> 8.96G    |4.48G    |2.99G    |2.99G    |2.99G    |1.49G    |1.49G    |1.49G    |
 5 --------> 9.52G    |4.76G    |3.17G    |3.17G    |3.17G    |1.59G    |1.59G    |1.59G    |
 6 --------> 9.42G    |4.71G    |3.14G    |3.14G    |3.14G    |1.57G    |1.57G    |1.57G    |
 7 --------> 10.96G   |5.48G    |3.65G    |3.65G    |3.65G    |1.83G    |1.83G    |1.83G    |
 8 --------> 9.88G    |4.94G    |3.29G    |3.29G    |3.29G    |1.65G    |1.65G    |1.65G    |
 9 --------> 11.09G   |5.55G    |3.7G     |3.7G     |3.7G     |1.85G    |1.85G    |1.85G    |
10 --------> 12.3G    |6.15G    |4.1G     |4.1G     |4.1G     |2.05G    |2.05G    |2.05G    |
11 --------> 13.51G   |6.76G    |4.5G     |4.5G     |4.5G     |2.25G    |2.25G    |2.25G    |
12 --------> 10.8G    |5.4G     |3.6G     |3.6G     |3.6G     |1.8G     |1.8G     |1.8G     |

Optimal segment size

numHosts --> 2        |4        |6        |8        |10       |12       |14       |16       |
numHours
 2 --------> 111.67M  |111.67M  |111.67M  |111.67M  |111.67M  |111.67M  |111.67M  |111.67M  |
 3 --------> 167.5M   |167.5M   |167.5M   |167.5M   |167.5M   |167.5M   |167.5M   |167.5M   |
 4 --------> 223.34M  |223.34M  |223.34M  |223.34M  |223.34M  |223.34M  |223.34M  |223.34M  |
 5 --------> 279.17M  |279.17M  |279.17M  |279.17M  |279.17M  |279.17M  |279.17M  |279.17M  |
 6 --------> 335M     |335M     |335M     |335M     |335M     |335M     |335M     |335M     |
 7 --------> 390.84M  |390.84M  |390.84M  |390.84M  |390.84M  |390.84M  |390.84M  |390.84M  |
 8 --------> 446.67M  |446.67M  |446.67M  |446.67M  |446.67M  |446.67M  |446.67M  |446.67M  |
 9 --------> 502.51M  |502.51M  |502.51M  |502.51M  |502.51M  |502.51M  |502.51M  |502.51M  |
10 --------> 558.34M  |558.34M  |558.34M  |558.34M  |558.34M  |558.34M  |558.34M  |558.34M  |
11 --------> 614.18M  |614.18M  |614.18M  |614.18M  |614.18M  |614.18M  |614.18M  |614.18M  |
12 --------> 670.01M  |670.01M  |670.01M  |670.01M  |670.01M  |670.01M  |670.01M  |670.01M  |

Consuming memory

numHosts --> 2        |4        |6        |8        |10       |12       |14       |16       |
numHours
 2 --------> 1.3G     |666.02M  |444.01M  |444.01M  |444.01M  |222.01M  |222.01M  |222.01M  |
 3 --------> 1.86G    |951.41M  |634.27M  |634.27M  |634.27M  |317.14M  |317.14M  |317.14M  |
 4 --------> 2.42G    |1.21G    |824.53M  |824.53M  |824.53M  |412.27M  |412.27M  |412.27M  |
 5 --------> 2.97G    |1.49G    |1014.79M |1014.79M |1014.79M |507.39M  |507.39M  |507.39M  |
 6 --------> 3.53G    |1.77G    |1.18G    |1.18G    |1.18G    |602.52M  |602.52M  |602.52M  |
 7 --------> 4.09G    |2.04G    |1.36G    |1.36G    |1.36G    |697.65M  |697.65M  |697.65M  |
 8 --------> 4.65G    |2.32G    |1.55G    |1.55G    |1.55G    |792.78M  |792.78M  |792.78M  |
 9 --------> 5.2G     |2.6G     |1.73G    |1.73G    |1.73G    |887.91M  |887.91M  |887.91M  |
10 --------> 5.76G    |2.88G    |1.92G    |1.92G    |1.92G    |983.04M  |983.04M  |983.04M  |
11 --------> 6.32G    |3.16G    |2.11G    |2.11G    |2.11G    |1.05G    |1.05G    |1.05G    |
12 --------> 6.87G    |3.44G    |2.29G    |2.29G    |2.29G    |1.15G    |1.15G    |1.15G    |
yupeng9 commented 3 years ago

I used -periodSampleSegmentConsumed 2h, but not sure the best value for it

mcvsubbu commented 3 years ago

If you run the later version that will be good. It has bugs ironed out

yupeng9 commented 3 years ago

I tried 0.5, but it seems I did not run it correctly

Trying to close RealtimeSegmentImpl : rta_eats_canonical_rt_order_states__1__4457__20201008T2039Z

============================================================
RealtimeProvisioningHelper -tableConfigFile table -numPartitions 4 -pushFrequency null -numHosts 2,4,6,8,10,12,14,16 -numHours 2,3,4,5,6,7,8,9,10,11,12 -sampleCompletedSegmentDir rta_eats_canonical_rt_order_states__1__4457__20201008T2039Z -ingestionRate 1000 -maxUsableHostMemory 48G -retentionHours 336

Note:

* Retention hours taken from tableConfig
* See https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime

Memory used per host (Active/Mapped)

numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
numHours
 2 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 3 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 4 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 5 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 6 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 7 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 8 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 9 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
10 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
11 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
12 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |

Optimal segment size

numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
numHours
 2 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 3 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 4 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 5 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 6 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 7 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 8 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 9 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
10 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
11 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
12 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |

Consuming memory

numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
numHours
 2 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 3 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 4 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 5 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 6 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 7 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 8 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 9 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
10 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
11 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
12 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |

Total number of segments queried per host (for all partitions)

numHosts --> 2               |4               |6               |8               |10              |12              |14              |16              |
numHours
 2 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 3 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 4 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 5 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 6 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 7 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 8 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
 9 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
10 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
11 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
12 --------> NA              |NA              |NA              |NA              |NA              |NA              |NA              |NA              |
mcvsubbu commented 3 years ago

This is the correct output. This means that your segment does not fit into any of these buckets with the kind of memory (48GB usable)and ingestion rate (1000 events per second per partition) that you have with your retention of 336 hours. i would suggest to use higher number of hosts and see where things are.

yupeng9 commented 3 years ago

I see. Why shall we add more hosts? Is it because we have too long retention, and therefore need to distribute the real-time segments to more hosts?

cc @chenboat

mcvsubbu commented 3 years ago

You can start by just running the command with either more hosts or more memory and let us see what it outputs

yupeng9 commented 3 years ago
RealtimeProvisioningHelper -tableConfigFile table -numPartitions 4 -pushFrequency null -numHosts 16,32,64 -numHours 2,3,4,5,6,7,8,9,10,11,12 -sampleCompletedSegmentDir rta_eats_canonical_rt_order_states__1__4457__20201008T2039Z -ingestionRate 1000 -maxUsableHostMemory 192G -retentionHours 336

Note:

* Table retention and push frequency ignored for determining retentionHours since it is specified in command
* See https://docs.pinot.apache.org/operators/operating-pinot/tuning/realtime

Memory used per host (Active/Mapped)

numHosts --> 16              |32              |64              |
numHours
 2 --------> 106G/106G       |106G/106G       |106G/106G       |
 3 --------> 106.22G/106.22G |106.22G/106.22G |106.22G/106.22G |
 4 --------> 106.45G/106.45G |106.45G/106.45G |106.45G/106.45G |
 5 --------> 107.92G/107.92G |107.92G/107.92G |107.92G/107.92G |
 6 --------> 106.89G/106.89G |106.89G/106.89G |106.89G/106.89G |
 7 --------> 107.11G/107.11G |107.11G/107.11G |107.11G/107.11G |
 8 --------> 107.33G/107.33G |107.33G/107.33G |107.33G/107.33G |
 9 --------> 109.43G/109.43G |109.43G/109.43G |109.43G/109.43G |
10 --------> 109.03G/109.03G |109.03G/109.03G |109.03G/109.03G |
11 --------> 109.56G/109.56G |109.56G/109.56G |109.56G/109.56G |
12 --------> 108.21G/108.21G |108.21G/108.21G |108.21G/108.21G |

Optimal segment size

numHosts --> 16              |32              |64              |
numHours
 2 --------> 643.23M         |643.23M         |643.23M         |
 3 --------> 964.85M         |964.85M         |964.85M         |
 4 --------> 1.26G           |1.26G           |1.26G           |
 5 --------> 1.57G           |1.57G           |1.57G           |
 6 --------> 1.88G           |1.88G           |1.88G           |
 7 --------> 2.2G            |2.2G            |2.2G            |
 8 --------> 2.51G           |2.51G           |2.51G           |
 9 --------> 2.83G           |2.83G           |2.83G           |
10 --------> 3.14G           |3.14G           |3.14G           |
11 --------> 3.45G           |3.45G           |3.45G           |
12 --------> 3.77G           |3.77G           |3.77G           |

Consuming memory

numHosts --> 16              |32              |64              |
numHours
 2 --------> 1.1G            |1.1G            |1.1G            |
 3 --------> 1.64G           |1.64G           |1.64G           |
 4 --------> 2.17G           |2.17G           |2.17G           |
 5 --------> 2.71G           |2.71G           |2.71G           |
 6 --------> 3.24G           |3.24G           |3.24G           |
 7 --------> 3.78G           |3.78G           |3.78G           |
 8 --------> 4.31G           |4.31G           |4.31G           |
 9 --------> 4.85G           |4.85G           |4.85G           |
10 --------> 5.38G           |5.38G           |5.38G           |
11 --------> 5.92G           |5.92G           |5.92G           |
12 --------> 6.45G           |6.45G           |6.45G           |

Total number of segments queried per host (for all partitions)

numHosts --> 16              |32              |64              |
numHours
 2 --------> 168             |168             |168             |
 3 --------> 112             |112             |112             |
 4 --------> 84              |84              |84              |
 5 --------> 68              |68              |68              |
 6 --------> 56              |56              |56              |
 7 --------> 48              |48              |48              |
 8 --------> 42              |42              |42              |
 9 --------> 38              |38              |38              |
10 --------> 34              |34              |34              |
11 --------> 31              |31              |31              |
12 --------> 28              |28              |28              |

Curious, how much memory do you allocate for Pinot server (i.e. -xmx) at Linkedin? And whats the memory size of your machine.

mcvsubbu commented 3 years ago

Pinot uses heap memory only for query execution (and then some one off operations like building indexes during load, etc.) In hosts that have consuming segments, heap memory is also use fo inverted indices, and segment build upon completion. Inverted indices for the consuming segment may expand as rows are ingested, and released in one go when the consuming segment completes. We have seen gc overload at that time.

Everything else uses offheap memory -- either mmapped or direct.

At linkedin, we typically have xmx of 16G, and use mmap for segments, and realime consumers.

Coming back to your use case, how many replicas do you have? (we need to output that as well, on the command line. I will fix that).

What was the value of xmx you provided?

yupeng9 commented 3 years ago

Thanks @mcvsubbu for the info.

We use 48G for xmx in production. I recall we increased it from 16G a while ago to alleviate some GC pressure.

This table has 3 replicas, and it's in a tenant of 8 servers.

mcvsubbu commented 3 years ago

Follow up discussion in slack