apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

Add support for Logical Tables #10712

Open egalpin opened 1 year ago

egalpin commented 1 year ago

It can come in handy to have the ability to fully replace a table without any disruption or changes for those issuing queries. If there was support to add/remove (especially atomically add and remove) aliases from tables, it would be easy to reconstruct a table with a new set of configurations, or bootstrapped in a different way than the original table, and then "swap" the table alias to the new table.

Ex. making an incompatible schema or table spec change

Design Doc: https://docs.google.com/document/d/1WX5-3XJ4Nrsod0_rO_JEtvgN9XVKDi_pUySistXAOkY/edit?usp=sharing

Jackie-Jiang commented 1 year ago

+1 on this. We can introduce the logical table concept which is used in the query, and by default using the physical table name as the logical table. With logical table concept, we can also associate multiple physical tables with the same logical table (one query access multiple tables).

egalpin commented 1 year ago

@Jackie-Jiang is the way that this is done for hybrid tables (replacing myTable with myTable_OFFLINE and myTable_REALTIME generic and flexible enough to be recycled to support this feature of table alias/logical table name?

Jackie-Jiang commented 1 year ago

@Jackie-Jiang is the way that this is done for hybrid tables (replacing myTable with myTable_OFFLINE and myTable_REALTIME generic and flexible enough to be recycled to support this feature of table alias/logical table name?

Good point! We can think of myTable as logical table, and myTable_OFFLINE and myTable_REALTIME as physical table. Currently the mapping is fixed, where logical table always map to physical table of the same name and with type suffix. We want to introduce a more flexible mapping, where logical table can map to other physical tables

mcvsubbu commented 1 year ago

+1 on this. We can introduce the logical table concept which is used in the query, and by default using the physical table name as the logical table. With logical table concept, we can also associate multiple physical tables with the same logical table (one query access multiple tables).

Are you suggesting that we somehow map one logical table to multiple physical tables at the same time (perhaps the phys table name somehow chosen during query time). Or, are we saying that there can be several physical tables underneath, we just nail one of them at any given time?

mcvsubbu commented 1 year ago

I think the mytable_REALTIME and mytable_OFFLINE is not quite the same analogy. In the case of REALTIME and OFFLINE, the expectation is that the two tables contain the same exact data. I think the original intent of this feature is that the underlying physical tables could contain diffferent data. @egalpin to correct me if I am wrong

egalpin commented 1 year ago

My original motivation was to add the ability to replace a table in a way that was transparent to anything issuing queries. Ex. myTable is a realtime upsert table with 8 kafka partitions. Later, data volume has grown such that I really ought to be using 16 partitions to handle the volume. Using this feature, I could go about adding myTable_v2 which is populated from a new kafka topic with 16 partitions this time. Then I could "flip the switch" so that queries would start using the new table instead of the original table. There are similar parallels for OFFLINE tables as well, where I might want to rebuild the table in some backward incompatible or potentially dangerous way, and maintain the ability to swap to the new table immediately, and also swap back immediately if things don't go according to plan.

If this can be extended or re-used to be able to support what Jackie mentioned, i.e. "fanout" queries to all tables with the same alias, that would be great too. I can think of many use cases where it would be helpful to have multiple tables sharing an alias.

If fanning out to multiple tables can be done via alias though, it feels like UNION might be even better to support?

egalpin commented 1 year ago

Are you suggesting that we somehow map one logical table to multiple physical tables at the same time (perhaps the phys table name somehow chosen during query time). Or, are we saying that there can be several physical tables underneath, we just nail one of them at any given time?

For me, the former of being able to hit multiple/all tables with the same logical name would be an awesome feature. Is initial design discussion ok here? I have a few questions that would help inform design considerations:

  1. Where would the mapping between physical/logical table names exist? Zookeeper? I would advocate strongly for a singleton record containing all of the mappings; this way, bulk changes can be applied atomically (ex. remove alias from one table, add to a different table happening atomically is great to have)
  2. Presumably this mapping would be cached/stored locally within Brokers for performance reasons? Thoughts on managing that aspect?

There are quite a few cases I can imagine making use of this so I'd like to get started in the near term once we reach an agreeable design 😊 Thanks in advance!

mcvsubbu commented 1 year ago

Are you suggesting that we somehow map one logical table to multiple physical tables at the same time (perhaps the phys table name somehow chosen during query time). Or, are we saying that there can be several physical tables underneath, we just nail one of them at any given time?

For me, the former of being able to hit multiple/all tables with the same logical name would be an awesome feature. Is initial design discussion ok here? I have a few questions that would help inform design considerations:

1. Where would the mapping between physical/logical table names exist? Zookeeper?  I would advocate strongly for a singleton record containing all of the mappings; this way, bulk changes can be applied atomically (ex. remove alias from one table, add to a different table happening atomically is great to have)

2. Presumably this mapping would be cached/stored locally within Brokers for performance reasons?  Thoughts on managing that aspect?

There are quite a few cases I can imagine making use of this so I'd like to get started in the near term once we reach an agreeable design 😊 Thanks in advance!

We are discussing requirements here, not design.

What is the use case that drives mapping mulitple physical tables to the same logical table? Can you elaborate a bit? Do the physical tables have the same schema? How does a given query (that may only have the logical tablename) choose between the physical tables to run the query in?

The answers to these questions may then drive what the mapping should look like, and eventually to where that mapping should live.

egalpin commented 1 year ago

What is the use case that drives mapping mulitple physical tables to the same logical table? Can you elaborate a bit?

Here are some example use cases that piqued my interest:

  1. I have a use case where I can liken the data to user sessions, where a session can be either active or closed. I would like to be able to have 3 tables which represent the total data: upsert-enabled realtime table representing an active sessions, plus a hybrid table to account for realtime ingestion of newly closed sessions as well as historical closed sessions. It isn't currently possible to query all of these tables at once, but it would be very nice to do so.
  2. "whale" or VIP tables, also "Priority queue". Sometimes, certain customers or set of customers represent an outsized portion of data which might not work well to overcome with Pinot's existing partitioning. Being able to isolate a certain set of customer data in a separate table that would still be queryable via a single table name such that those issuing queries do not need to have awareness of DB organization details to conditionally target the correct table
  3. User-managed time partitioning. Imagine a time series dataset. Being able to have a collection of tables which each holds a given time-period of data would be helpful operationally.

Do the physical tables have the same schema?

Yes I would guess so (like in the case of a hybrid table today). Or at very least, mutually shared columns would have the same types. It might be ideal to be able to provide support for tables having a subset/superset of columns, but that's not a "must" feature for a v1 IMO.

How does a given query (that may only have the logical tablename) choose between the physical tables to run the query in? I believe that, at least initially, the query would strictly choose all physical tables with the same logical name. There might be ways to optimize that in the future Ex. in the above example of VIP tables, where we might be able to select only 1 out of many physical tables based on some fact we know about the table architecture and query inputs. But I don't think that would be a requirement of an initial version.

My main priority would be the use case of being able to replace a table easily and seamlessly. That wouldn't require the ability to support multiple physical tables with the same logical name. That said, I can foresee making use of the ability to have multiple physical tables with the same logical name, so it would be nice to do all in one go if feasible.

egalpin commented 1 year ago

Friendly ping 😊 Any thoughts on how these use cases would impact recommendations for what the mapping should look like or where it should live? In case it's helpful as a starting point, I had thought the aliases could be stored as a single json-serialized string of key-value pairs where the keys are the alias name and the values are arrays of physical table names associated with the alias, like so:

{
    "myAlias": ["physicalTable1", "physicalTable2"],
    "myOtherAlias": ["otherPhysicalTable1","otherPhysicalTable2","otherPhysicalTable3"]
}

This would allow for easily modifying associations of many aliases atomically, as well as easy translation to java Map with constant time lookup of physical tables that should be targeted for a query. Thoughts?

Jackie-Jiang commented 1 year ago

If we just want to add the alias support, it should be similar to the case insensitive feature, where we keep a mapping from lower-case name to the actual name. Similarly, we can keep the alias to actual name mapping within the TableCache. With this approach, we can achieve the alias functionality, but not the multiple physical tables associated with the same logical table.

egalpin commented 1 year ago

@Jackie-Jiang do you think that it would be possible to use aliases to support a use case like:

Jackie-Jiang commented 1 year ago

@egalpin To achieve that, we need to introduce logical table concept, where one logical table can point to multiple physical tables. It is no longer a simple alias (alias is one to one mapping). When querying the logical table, pinot needs to smartly prune tables that doesn't contain the column queried

egalpin commented 1 year ago

Right, yep 👍 Just wanted to know if there was opposition to the idea generally, or absolute technical limitation from preventing even trying the approach. Thanks!

A few other questions, assuming the logical/physical table approach:

Jackie-Jiang commented 1 year ago

We need some design around this, and here is what comes into my mind as of now:

mcvsubbu commented 1 year ago

Can you fill me in on what "to solve the hybrid table" means?

My suggestion would be to NOT change the hybrid table definition. Instead, keep it the same. The logical table binding should happen before we branch between realtime/offline.

So, the query comes into the broker, we lookup if there is a physical table defined, substitute the physical table name(s) and then do further query processing. Each of the underlying physical tables could be hybrid, or realtime-only or offline-only.

In terms of allowed mapping, we should have something that enables mapping one logical table to one or more physical tables. If multiple physical tables are configured, then another config could say whether it the code should pick any or all (@egalpin 's requirement). Not sure if there is need for specific additional (configurable) logic depending on which table is picked, but we can let that ride for now.

+1 on the brokers should recognize immediately when mapping is changed. To that effect, maybe the mapping should be stored in zookeeper, away from TableConfig. Maybe it can be under the PROPERTYSTORE/CONFIGS/CLUSTER ? It is OK if the brokers do not set a watch (perhaps preferred that way). The mapping can be updated via a controller API, and the brokers informed by the controller.

Some other random thoughts:

Instead of thinking this through peace meal, I strongly suggest we start writing a design doc, with at least the requirements part clearly identified.

Jackie-Jiang commented 1 year ago

My suggestion would be to NOT change the hybrid table definition. Instead, keep it the same. The logical table binding should happen before we branch between realtime/offline.

This is basically the idea. Currently hybrid table is an implicit logical table, where it always consist of 2 physical tables - one offline and one real-time. We can keep it implicit and connecting them by the raw table name. I'm thinking we may also introduce an explicit hybrid logical table concept in addition to the implicit one where we allow connecting multiple offline tables with multiple real-time tables, but we need to design a way to represent the time boundary.

So, the query comes into the broker, we lookup if there is a physical table defined, substitute the physical table name(s) and then do further query processing. Each of the underlying physical tables could be hybrid, or realtime-only or offline-only.

Hybrid table itself is already not a physical representation, so I'd prefer modeling it as a logical table, even though it is implicit. Then the abstraction would be logical table can point to multiple tables, either physical or logical. This way, we can share the same abstraction for the hybrid table management (currently it is hard-coded into 2 parts, one for real-time and one for offline).

In terms of allowed mapping, we should have something that enables mapping one logical table to one or more physical tables. If multiple physical tables are configured, then another config could say whether it the code should pick any or all (@egalpin 's requirement). Not sure if there is need for specific additional (configurable) logic depending on which table is picked, but we can let that ride for now.

+1 on the brokers should recognize immediately when mapping is changed. To that effect, maybe the mapping should be stored in zookeeper, away from TableConfig. Maybe it can be under the PROPERTYSTORE/CONFIGS/CLUSTER ? It is OK if the brokers do not set a watch (perhaps preferred that way). The mapping can be updated via a controller API, and the brokers informed by the controller.

Yes, mapping needs to be stored in ZK under property store. We can discuss the path in the detailed design. Broker doesn't need a watch, when the mapping is updated, controller can send a message to broker to refresh the routing (same as current table config update).

  • The logical table should not be a Helix resource (intuitively). Let me know if there is a problem with this, and we can discuss further.

Logical table should be a partition under BrokerResource, so that broker can load the mapping.

  • As a consequence, the logical table cannot have a logicalTableName_REALTIME physical table, ever.

I don't follow this. Logical table is just a mapping from a logical table name to n table resources, and it can contain any table type.

  • How will table metrics be emitted? Ideally, all table level metrics should be emitted under the logical table name. Code may become a bit messy at place (emit physical table, logical table, and global metrics)

Good point. We need to design this properly. Physical computation metrics can be associated with the physical table, query stats can be associated with the logical table.

  • Operational tools need to be examined: If a logical table maps to a different physical table, then some of the table APIs should be modified to reflect that there is a different physical table. Not sure how this will work if there is more than one physical table.

I'd imagine adding a new set of APIs for logical tables, where we only allow modifying the mapping. Physical table management might remain the same. E.g. we don't really have any API associated with the hybrid table.

  • At least for a start, let us assume that all physical tables have the same schema. This can throw a wrenh into having multiple copies of the same schema, since we insist now that schema name is the same as table name. Either the restriction should be relaxed, or some way provided so that schema changes are updated for all physical tables at the same time (e.g. a schama change is allowed only on the logical table).

This actually depends on the design. If the logical table is simply a mapping, then there is no schema associated with it. Schema only associates with physical table.

Instead of thinking this through peace meal, I strongly suggest we start writing a design doc, with at least the requirements part clearly identified.

Yes. We just brainstorm and put some random thoughts here to be covered in the design. This should be carefully designed.

mcvsubbu commented 1 year ago

My suggestion would be to NOT change the hybrid table definition. Instead, keep it the same. The logical table binding should happen before we branch between realtime/offline.

This is basically the idea. Currently hybrid table is an implicit logical table, where it always consist of 2 physical tables - one offline and one real-time. We can keep it implicit and connecting them by the raw table name. I'm thinking we may also introduce an explicit hybrid logical table concept in addition to the implicit one where we allow connecting multiple offline tables with multiple real-time tables, but we need to design a way to represent the time boundary.

What is the use case driving this requirement of needing to maintain time boundary across multiple tables like you describe? I am inclined to say that we let the current REALTIME/OFFLINE be as it is, and call it a HYBRID "physical" table, and not let this leak into the logical table concept we are trying to build (unless there is a use case that cannot be solved otherwise). This may end up complicating the logical table design, just to shoehorn it into the current hybrid table mechanisms.

Jackie-Jiang commented 1 year ago

What is the use case driving this requirement of needing to maintain time boundary across multiple tables like you describe? I am inclined to say that we let the current REALTIME/OFFLINE be as it is, and call it a HYBRID "physical" table, and not let this leak into the logical table concept we are trying to build (unless there is a use case that cannot be solved otherwise). This may end up complicating the logical table design, just to shoehorn it into the current hybrid table mechanisms.

Pinot has a scale problem of number of segments in a single physical table due to ZK limit, which can also be solved with the logical table. Thinking about the current hybrid table, it is 2 physical tables with a time boundary. IMO modeling it as a physical table can be more confusing because it is not the basic management unit on server, but only a routing concept on broker, which is actually aligned with the proposed logical table concept.

mcvsubbu commented 1 year ago

Can you provide some more thoughts on how the segment count problem in a single physical table can be solved by the concept of logical table (and how it is connected to the current hybrid table)?

egalpin commented 1 year ago

@mcvsubbu I think the idea is that if multiple physical OFFLINE tables and multiple physical REALTIME tables could be composed into a single logical table, including maintaining the notion of a timeBoundary, then the segment count limitation could be mitigated by adding additional physical OFFLINE or REALTIME tables and including them in the "logical hybrid" table.

I think it would be a great idea to eventually have hybrid tables represented using the same concept of physical/logical table so that hybrid tables would be less of a special case. That said, I think it might be worth separating that work into a distinct follow-up outside the scope of initially introducing physical/logical table feature.

I'll try to capture the conversation so far here into a design doc to better allow organization of requirements, intentional punts/anti-features/follow-up features, as well as a place to share design ideas once we settle requirements.

mcvsubbu commented 1 year ago

OK, let me try it another way: What limits of ZK are we(you) hitting? Number of nodes? read/write throughput? Both? And how will your proposal of managing time boundary across multiple realtime and offline tables help here?

Jackie-Jiang commented 1 year ago

@mcvsubbu The main issue is the ZNode size limit (1M by default) which can be hit by the table's ideal state, external view or current state when a physical table contains too many segments (~10K). One workaround is to increase the jute buffer size, but that can potentially cause memory issue on ZK, and is not a thorough fix of the scale problem. By introducing logical table, we can split the segments into multiple physical tables and query them all together. This way we can horizontally scale the segments.

mcvsubbu commented 1 year ago

Ah, got it. So, you are trying to limit the number of segments for a single table.

So, in that case, I would think that you will need one realtime table, and multiple offline tables for the same logical table. I got confused with the requirement of multiple realtime tables and managing time boundaries across those. Let me know if this captures the requirment (that we need to be able to manage multiple time boundaries of the logical table, that spans the physical tables -- a single realtime table and multiple offline tables).

Jackie-Jiang commented 1 year ago

Managing multiple time boundaries within the hybrid table can be confusing, so I'd prefer still only allowing single time boundary (at least initially). Time boundary is extracted from the offline table, so we can probably pick the highest timestamp across all offline tables (can be configurable). I don't see a blocker for supporting multiple real-time tables within a single hybrid table as long as the time boundary is properly managed. Basically the hybrid table can be modeled as a logical table with single time boundary attached to offline tables and real-time tables.

mcvsubbu commented 1 year ago

@Jackie-Jiang and I had a discussion yesterday. It seems like he has in mind to tackle two problems with this.

In both of these use cases, it seems to me that the logical table abstraction should also be extended to data push.

On the offline side, this will mean some functionality in Pinot that will take the data pushed to the logical table, and distribute it correctly across the physical tables.

On the realtime side, it is more murky, maybe physical tables need to be created to consume from single topic, and a logical table created to declare that these be combined.

What I was looking for, on the other hand, was just to be able to substitute the table name in the query with a physical table name (i.e. only for query purposes). The use cases I have in mind have duplcate data across the tables, so it does not make sense to route queries to both tables. Some mechanism needs to be provided in pinot to choose between the tables when a query comes in. From a data push standpoint, each table may push its own data (with the data being a bit different between them).

TBD whether the three requirements should be (or can be) combined into one feature (while still keeping it simple)

Jackie-Jiang commented 1 year ago

@mcvsubbu I'm not planning to change the ingestion flow at least for the first stage. The destination of the ingestion should remain the physical table (OFFLINE or REALTIME), and logical table is just a routing concept, which is managed on the broker.

With current Pinot, we can think of the raw table as the logical table, and the mapping is implicit:

The proposed logical table is a mapping from a logical table name to multiple physical tables plus an optional time boundary. In the future we may loose this constrain to allow logical table mapping to another logical table. All the change should be on the broker side, and there should be no change needed on ingestion flow.

egalpin commented 11 months ago

I've added a design doc to this issue description. It's not so much a clean and clear proposed solution, but rather a starting point in order to get to a solution. I've tried to capture the above conversation as best I can, though I've focused on my own requirements and may not have captured all proposals.

I'd very much appreciate your ongoing input into the design doc!

hpvd commented 5 months ago

Probably worth an own issue, but maybe, because the problem was occurring in this discussion, this is a good fitting place to start:

regarding the scalability problems of zookeeper, we may want to look into Oxia, which is on the way to solve this metadata scalability problem for Apache Pulsar. Oxia is advertised as: "a genuinely horizontally scalable metadata framework. It empowers distributed messaging systems to seamlessly handle hundreds of millions of topics, all while removing the intricacies and operational burdens associated with ZooKeeper"

https://github.com/streamnative/oxia-java

Background: see

Pulsar