Open noahprince22 opened 4 years ago
Thanks @noahprince22 for doing the analysis and creating this issue. This is an amazing feature to have.
Adding broker pruning by time should be easy. Yes, routing table should be enhanced to return the list of segments for a time efficiently.
The biggest piece of work will be around tablemanager and segmentmanager. Let's create separate issues for time based pruning and datatypes and link it to his issue.
https://github.com/apache/incubator-pinot/issues/6189
Created an issue for the segment pruning.
As a potential future optimization, after we lazy load entire segments, we should look into separately caching the metadata and the columns.psf
. creation.meta, index.map, metadata.properties
are all small and could either be eagerly loaded or have much looser LRU requirements.
https://eng.uber.com/operating-apache-pinot/
Reading this blog:
As the scale of data grew, we also experienced several issues caused by too many segments. Pinot leverages Apache Helix over Apache Zookeeper for cluster management. For example, when a server transitioned from offline to online, Pinot will propagate state transition messages via Helix to notify other instances. The number of such state transition messages are proportional to the number of the segments on the server. When a server hosts too many segments, there could be a spike of state transition messages on Helix, resulting in lots of zookeeper nodes. If the number of zookeeper nodes is beyond the buffer threshold, the Pinot server and controller will crash. To solve this issue, we added message throttling to Pinot controllers to flatten the state transition surge.
At large scale of data that requires this kind of lazy loading, you're going to have a lot of segments. Do we see this causing an issue with helix state management messages?
https://eng.uber.com/operating-apache-pinot/
Reading this blog:
As the scale of data grew, we also experienced several issues caused by too many segments. Pinot leverages Apache Helix over Apache Zookeeper for cluster management. For example, when a server transitioned from offline to online, Pinot will propagate state transition messages via Helix to notify other instances. The number of such state transition messages are proportional to the number of the segments on the server. When a server hosts too many segments, there could be a spike of state transition messages on Helix, resulting in lots of zookeeper nodes. If the number of zookeeper nodes is beyond the buffer threshold, the Pinot server and controller will crash. To solve this issue, we added message throttling to Pinot controllers to flatten the state transition surge.
At large scale of data that requires this kind of lazy loading, you're going to have a lot of segments. Do we see this causing an issue with helix state management messages?
If you have millions of segments per table, this can be a problem. We already faced this issue at LinkedIn, and took a few actions:
I believe we need to also explore segments in cold storage that DO NOT make it into the idealstate. I don't have any specific ideas along these lines, though :)
https://apache-pinot.slack.com/archives/CDRCA57FC/p1603720037246100
Some discussion already here.
This would involve modifying the pinot server to include a lazy mode that would set it to lazily pull segments as they are requested using an LRU cache. It should just take some modification to the SegmentDataManager and maybe the table manager.
This would allow using s3 as the primary storage, with pinot as the query/caching layer for long term historical tiers of data. Similar to the tiering example, you’d have a third set of lazy servers for reading data older than 2 weeks. This is explicitly to avoid large EBS volume costs for very large data sets.
My main concern is this — a moderately sized dataset for us is 130GB a day. We have some that can be in the terra range per day. Using 500MB segments, you’re looking at ~260 segments a day. Maybe ~80k segments a year. In this case, broker pruning is very important because any segment query sent to the lazy server means materializing data from s3. This data is mainly time series, which means segments would be in time-bound chunks. Does Pinot broker prune segments by time? How is the broker managing segments? Does it just have an in-memory list of all segments for all tables? If so, metadata pruning will become a bottleneck for us on most queries. I’d like to see query time scale logarithmically with the size of the data.