Support Lazy loading of Offline Segments

noahprince22 commented 4 years ago

https://apache-pinot.slack.com/archives/CDRCA57FC/p1603720037246100

Some discussion already here.

This would involve modifying the pinot server to include a lazy mode that would set it to lazily pull segments as they are requested using an LRU cache. It should just take some modification to the SegmentDataManager and maybe the table manager.

This would allow using s3 as the primary storage, with pinot as the query/caching layer for long term historical tiers of data. Similar to the tiering example, you’d have a third set of lazy servers for reading data older than 2 weeks. This is explicitly to avoid large EBS volume costs for very large data sets.

My main concern is this — a moderately sized dataset for us is 130GB a day. We have some that can be in the terra range per day. Using 500MB segments, you’re looking at ~260 segments a day. Maybe ~80k segments a year. In this case, broker pruning is very important because any segment query sent to the lazy server means materializing data from s3. This data is mainly time series, which means segments would be in time-bound chunks. Does Pinot broker prune segments by time? How is the broker managing segments? Does it just have an in-memory list of all segments for all tables? If so, metadata pruning will become a bottleneck for us on most queries. I’d like to see query time scale logarithmically with the size of the data.

kishoreg commented 4 years ago

Thanks @noahprince22 for doing the analysis and creating this issue. This is an amazing feature to have.

Adding broker pruning by time should be easy. Yes, routing table should be enhanced to return the list of segments for a time efficiently.

The biggest piece of work will be around tablemanager and segmentmanager. Let's create separate issues for time based pruning and datatypes and link it to his issue.

noahprince22 commented 4 years ago

https://github.com/apache/incubator-pinot/issues/6189

Created an issue for the segment pruning.

noahprince22 commented 4 years ago

As a potential future optimization, after we lazy load entire segments, we should look into separately caching the metadata and the columns.psf. creation.meta, index.map, metadata.properties are all small and could either be eagerly loaded or have much looser LRU requirements.

noahprince22 commented 4 years ago

https://eng.uber.com/operating-apache-pinot/

Reading this blog:

As the scale of data grew, we also experienced several issues caused by too many segments. Pinot leverages Apache Helix over Apache Zookeeper for cluster management. For example, when a server transitioned from offline to online, Pinot will propagate state transition messages via Helix to notify other instances. The number of such state transition messages are proportional to the number of the segments on the server. When a server hosts too many segments, there could be a spike of state transition messages on Helix, resulting in lots of zookeeper nodes. If the number of zookeeper nodes is beyond the buffer threshold, the Pinot server and controller will crash. To solve this issue, we added message throttling to Pinot controllers to flatten the state transition surge.

At large scale of data that requires this kind of lazy loading, you're going to have a lot of segments. Do we see this causing an issue with helix state management messages?

mcvsubbu commented 4 years ago

https://eng.uber.com/operating-apache-pinot/

Reading this blog:

As the scale of data grew, we also experienced several issues caused by too many segments. Pinot leverages Apache Helix over Apache Zookeeper for cluster management. For example, when a server transitioned from offline to online, Pinot will propagate state transition messages via Helix to notify other instances. The number of such state transition messages are proportional to the number of the segments on the server. When a server hosts too many segments, there could be a spike of state transition messages on Helix, resulting in lots of zookeeper nodes. If the number of zookeeper nodes is beyond the buffer threshold, the Pinot server and controller will crash. To solve this issue, we added message throttling to Pinot controllers to flatten the state transition surge.

At large scale of data that requires this kind of lazy loading, you're going to have a lot of segments. Do we see this causing an issue with helix state management messages?

If you have millions of segments per table, this can be a problem. We already faced this issue at LinkedIn, and took a few actions:

Increased the network speed in zookeeper.
Moved to allocate a limited number of instances for any table (i.e. cluster the segments of a table within as few hosts as possible).
We tried enabling helix batching with mixed results. It seemed to help, but other helix bugs took over and prevented us from moving further on that. Some of those bugs may have been fixed, so we may be able to enable batching again.

I believe we need to also explore segments in cold storage that DO NOT make it into the idealstate. I don't have any specific ideas along these lines, though :)

apache / pinot

Support Lazy loading of Offline Segments #6187