apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Support Serving Latest Offline Segments Immediately #5751

Open ianvkoeppe opened 4 years ago

ianvkoeppe commented 4 years ago

Overview

For hybrid setups, Pinot splits/filters broker queries to both the offline and realtime tables. It does so based on a time value which represents the latest available offline segments. Pinot assumes the consumer is uploading partial or intermediate segments throughout the current day, so it only serves time value - 1 from the offline table and the rest from realtime.

As an explicit example, if I upload a segment with the date 1/2/2000, the time value for the table will be 1/1/2000. This means, a query for all time, will be modified to query (, 1/1/2000] to the offline table and (1/1/2000, ) to the realtime table. In this scenario, the 1/2/2000 segment is not being served queries. Once another segment is uploaded for 1/3/2000, then the data from 1/2/2000 will be served.

Problem Statement

In hybrid table setups where offline segments are only uploaded once per day, it is desired that those offline segments immediately start serving over the realtime segments. Today, this has to be achieved via a "hack" where an empty segment is pushed for a future date to trick Pinot into serving the latest offline segment with actual data.

Requirements

  1. This should be configurable per table.

Design Doc

https://docs.google.com/document/d/1KykbUIstFI7F4IOOlUVnXWzpgKazwRAJdok0RT1ae64

mcvsubbu commented 4 years ago

Here is my suggested approach.

  1. Create a config in table that indicates that for a table we need the time boundary to be next unit up. (i.e. if daily push, next day, if hourly the next hour. etc.)
  2. Change the time boundary code to move up the time boundary when new segments are added to the table.

In this case, there is no new API, but for use cases that have multiple segments, the time boundary will be advanced when the first segment is pushed. This will lead to inconsistency.

Pinot currently has the inconsistency problem only if RT is not there, but we will be introducing a new one now.

We can, however, introduce an API as well, which can send out a message to the brokers to advnce time bondary. This API will be invoked by the push job when all segments are pushed. This iwll reduce the inconsistency window to some narrow cases such as: (1) Three segments are being pushed. After the first one, a broker restarts, and lokoing at the table config, and the recent timestamp, advances the boundary. Since the other two segments are not pushed yet, the broker has no way of knowing that a message is going to come by later. Corner case, but maybe we can leave that for the day when we support "atomic" push of data. (2). Three segments are being pushed. The push complets, but the external view is not yet updated (slow server, slow helix, whatever). The broker meanwhile gets the message to move time boundary.

The second one is more likely to happen, but either one is a bad experience for the application.

Some mitigation may be to include the segment names in the controller api and the controller waits for the EV of these segments to be ONLINE before issuing a message, but that is also not complete. Broker may still not see the new EV

Ideas?

kishoreg commented 4 years ago

@mcvsubbu the time boundary should be saved somewhere in ZK so that all brokers use it and it handles all the failure and restarts scenarios.

The user still has to wait for the servers to load the segment before invoking the time boundary.

mcvsubbu commented 4 years ago

@kishoreg I didn't follow. We don't store the time boundary now. Why do we need to keep it in ZK? Perhaps you are trying to support an arbitrary time boundary that a user can set. I don't see a use case for that now. My thought is that we can support one of two time boundaries -- either include the latest <day/hour> or exclude it. I think this will cover most of the use cases.

What is the use case to cover up to some arbitrary user-specified time?

kishoreg commented 4 years ago

the latest day will clearly not work unless there is one segment for that day and it is too limiting.

The use case I am thinking is the same one listed here. If we allow this to be part of ZK, then time boundary is max(max(segments) - 1 , value in ZK)

Eventually, we want to avoid users having to manage offline jobs. Users are already asking for features where we provide a minion task that moves realtime segments to offline nodes automatically after a day. Having this API will allow us to control the time boundary.

mayankshriv commented 4 years ago

Summarizing the offline discussion between @mcvsubbu @snleee @sajjad-moradi and @mayankshriv:

  1. Since there are use cases that may push partial data for the day (UTC vs PST), a table level config is warranted to ensure where this feature can be enabled.
  2. We will have a primitive to update the time boundary after segments are pushed. This will create a znode to persist the time boundary value.
  3. We will need to replace the existing EV based time-boundary computation with one that is based on the Znode listener. We might also need a periodic auto-correct task (in case updating the Znode fails).
mcvsubbu commented 4 years ago

Let us avoid combining this with #5712 . That was replacing segments, and this is about adjusting time boundary after segments are pushed. A single primitive to adjust the time boundary, invoked by the segment pusher should be enough

mcvsubbu commented 4 years ago

I also thought some more about znode and watches. Introducing a new node may be fine, but having the broker set a watch on that needs some thought. I remember in the HLC case, the controlller was setting too many watches, and that caused zookpeer re-connects to fail. A re-connect requires that the zk client send the entire watch list in one message to the zk server. We had to increase the message size setting in order to overcome this...

mayankshriv commented 4 years ago

It seems discussing here may not be as effective as desired. My proposal is to have a design doc within initial proposal detailed out. We can use that as a starting point to discuss the pros/cons. @sajjad-moradi @snleee is one of you still planning to start that design doc?

sajjad-moradi commented 4 years ago

@mayankshriv sure, I can start the initial design doc.

sajjad-moradi commented 4 years ago

Here is the design doc proposing three different solutions for this problem: https://docs.google.com/document/d/1KykbUIstFI7F4IOOlUVnXWzpgKazwRAJdok0RT1ae64 Please review and leave your comments.

ianvkoeppe commented 4 years ago

Here is the design doc proposing three different solutions for this problem: https://docs.google.com/document/d/1KykbUIstFI7F4IOOlUVnXWzpgKazwRAJdok0RT1ae64 Please review and leave your comments.

Thanks, added to ticket.

sajjad-moradi commented 4 years ago

@ianvkoeppe The agreement on design doc is to go with approach 1 as the short term solution. Since this approach is very similar to what you provided in PR #5745, could you please reopen the PR and apply the new changes. Basically the boolean flag that you have defined in TableConfig need to be changed to a number indicating how many days (or hours) need to be deducted from the time boundary. Default value should be one for backward compatibility. Follow PR #4156 to make the changes on time boundary calculation.

mcvsubbu commented 4 years ago

@sajjad-moradi we would prefer to discuss table config items in a design doc. If you can update the design doc with the proposed config, then @ianvkoeppe can go for it. thanks.

sajjad-moradi commented 4 years ago

@sajjad-moradi we would prefer to discuss table config items in a design doc. If you can update the design doc with the proposed config, then @ianvkoeppe can go for it. thanks.

The doc is updated with config suggestion. Please review and leave your comments.

ianvkoeppe commented 4 years ago

Thanks all! I appreciate the rapid iteration on the design. @sajjad-moradi I'm happy to iterate on that PR and make necessary changes.

mayankshriv commented 4 years ago

@ianvkoeppe Just checking if you are still planning on iterating your PR as per the new design?

ianvkoeppe commented 4 years ago

@mayankshriv, I would like to still get this done, but it's not something I can prioritize right now.