confluentinc / ksql

The database purpose-built for stream processing applications.
https://ksqldb.io
Other
73 stars 1.04k forks source link

Define Access Pattern for Metadata Fields #3734

Closed big-andy-coates closed 4 years ago

big-andy-coates commented 4 years ago

Background: https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

Meta columns, i.e. currently only ROWTIME, are the only columns which aren't driven by the data, i.e. they are implicit columns. As KSQL is enhanced we want to expose more metadata, e.g. the partition number, the message offset and, eventually, headers. However, doing so using the current pattern of one column per metadata field, will result in breaking changes: if a user has data with a column named partition then their query will break if we release a new version of KSQL with a partition meta column.

To avoid this, we should consider moving all the metadata under a single METADATA column of type STRUCT. We are then free to add new fields to the struct without risk of breaking peoples queries.

Initially, it would just be STRUCT<timestamp BIGINT>, but can quickly become STRUCT<timestamp BIGINT, partition INT, offset BIGINT> etc.

blueedgenick commented 4 years ago

a random, but related!, thought is that the 'meta-columns' may be better exposed as scalar functions rather than masquerading as columns - e.g. ROWTIME(), PARTITION() and so on. C.f. how the ROWNUM() function works in Oracle and derivative systems (although not quite the same, cos that's scoped to a resultset). I guess this goes back to the age-old argument of one person's metadata being another person's data...

On Mon, Nov 4, 2019 at 7:25 AM Andy Coates notifications@github.com wrote:

Background: #3722 (comment) https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

Meta columns, i.e. currently only ROWTIME, are the only columns which aren't driven by the data, i.e. they are implicit columns. As KSQL is enhanced we want to expose more metadata, e.g. the partition number, the message offset and, eventually, headers. However, doing so using the current pattern of one column per metadata field, will result in breaking changes: if a user has data with a column named partition then their query will break if we release a new version of KSQL with a partition meta column.

To avoid this, we should consider moving all the metadata under a single METADATA column of type STRUCT. We are then free to add new fields to the struct without risk of breaking peoples queries.

Initially, it would just be STRUCT, but can quickly become STRUCT<timestamp BIGINT, partition INT, offset BIGINT> etc.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/confluentinc/ksql/issues/3734?email_source=notifications&email_token=ABCXJIH4VR3ABH436E6SRV3QSA5HDA5CNFSM4JIVFJT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HWUHFDA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCXJIBF4POMO4QKOHTABDTQSA5HDANCNFSM4JIVFJTQ .

agavra commented 4 years ago

I much prefer having them as scalar functions - it makes it so much easier to reason about these columns! They're not in the schema but they're always accessible. This basically models them exactly as we wanted them, and if a user wants them in the schema they could always have SELECT ROWTIME() as ROWTIME or whatever they want.

big-andy-coates commented 4 years ago

Interesting. This is certainly worth considering.

Related discussion: https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

cc @MichaelDrogalis @derekjn what do you think to the idea of using system functions to access metadata like ROWTIME, partition and offset, rather than them being implicitly part of the schema.

MichaelDrogalis commented 4 years ago

That's a pretty interesting idea. If we project out what a stream or table's schema will look like when we support all the pieces of data that a Kafka record has, the schema will be pretty wide by default. I think scalar function sense here, but @derekjn has the best experience to know the pros/cons.

derekjn commented 4 years ago

I think that storing row-level metadata in a more flexible format is a really good idea that will undoubtedly pay dividends moving forward as we evolve the system. It also seems to me that metadata storage and access are somewhat orthogonal though. We can store metadata as a Struct and access that metadata using a number of different approaches, e.g.:

And we can also just use the existing column reference approach if we simply recognized a ROWTIME reference (or any other metadata column) as a special column that results in a METADATA Struct lookup.

As far as I know, ROWNUM, ROWID etc. in Oracle are actually more like psuedocolumns that are referenced as columns and not necessarily function calls (could be wrong though!). I'm sure they map to a function call internally but cosmetically they look and feel like columns to a user. PostgreSQL's version of this pattern is system columns, where every table has a bunch of metadata columns that may be referenced but aren't shown as part of a table's schema. System columns in PostgreSQL are also referenced like regular columns and not function calls.

I think both the scalar function call approach and the psuedocolumn approach are pretty good, although I'm not sure that using function call syntax adds much value for users. Are the extra () bytes helpful?

What does everyone think about simply mapping metadata column references (e.g. ROWTIME) to a METADATA Struct lookup internally? That approach has the additional benefit of being congruent with how KSQL currently feels, and it's arguably more consistent with what's happening internally (row-level value retrieval).

big-andy-coates commented 4 years ago

Makes sense @derekjn

I think what's key here is that the meta columns should not be returned if a user does SELECT * FROM X. If they want the metadata they can via SELECT *, ROWTIME FROM X;.

If we go down this route, then it's probably ok to keep all the meta columns flat, i.e. no nesting in a STRUCT, just ROWTIME ROWOFFSET ROWPARTITION. Headers might need more thought... here a set of functions may make more sense, but we don't need to worry about that now.

derekjn commented 4 years ago

If we go down this route, then it's probably ok to keep all the meta columns flat, i.e. no nesting in a STRUCT

I think the key thing is storing metadata columns in such a way that adding new meta columns doesn't break the existing storage layout.

blueedgenick commented 4 years ago

I think when we say "storage" of the record metadata we need to be careful to qualify what we mean - the metadata is, after all, actually stored in a Kafka record, in a very specific manner. What I think we are discussing here is the "model" of these specific metadata items, and there are 2 users of that model: the human ksql user, who cares about how to access and manipulate it; and also the rest of the ksql codebase, which wants to understand the implications of how it should manipulate that metadata under certain circumstances.

It's a long time ago now (2.5 years already? gulp!) since we first decided to have ROWTIME in it's current form and, if memory serves, I was responsible for that decision. The intent at that time was for it to be a pseudocolumn, just like Oracle ROWNUM, primarily so that (a) it doesn't have to be declared in every DDL to be accessible; and (b) we wanted to be able, in a future enhancement, to allow some "pseudocolumns" to be explicitly updated by a query (obviously more applicable to timestamp and headers than to offset or partition number!). I say this because we thought at the time that making them accessible via scalar functions would make value-changing more syntactically difficult (what 'column' do you assign the result of the manipulation to?).

I mention all this because it seems that the driving factor in how we represent these metadata values to the user depends largely on what we want a user to be able to do with them.... If they are read-only in every circumstance, then pseudocolumns (whether presented as being in a pseudo-struct or not) are pretty much equivalent to scalar functions for access, with the exception of potential naming clashes. I think we created confusion around what happens in projections, especially through joins and groupings, by taking a shortcut for short-term expediency and making the ROWKEY accessible in a similar way as ROWTIME (and that was definitely my fault too). I've always said that ROWKEY should go away as soon as we can properly represent a subset of the normal fields of a message as forming the key, and the confusion this has caused about what pseudocolumns are for and how they should behave has unfortunately compounded as it took us much longer than anticipated to fix up the handling of key fields.

So in conclusion I want to advocate that: 1 - ROWKEY should burn in fire asap (on which I think we all agree); 2 - other metadata fields (time, headers, offset, partition) should be accessible through ubiquitous pseudocolumns much like ROWTIME today. Namespacing them into a special struct may reduce the potential for name conflicts but not remover it, and that suggests to me that the added complexity isn't worth it. There are better ways to deal with naming conflicts via aliasing. 3 - in a future version, some of them should be capable of being assigned (i.e. updated), which is slightly different than the way something like ROWNUM works but will be handy for certain usecases. (Imagine an aggregate function which can combine the entries of multiple maps, and how you could then use this to combine the header values of multiple messages flowing through a groupby, and then you could assign the result of that udaf to be the headers of the output record. I've been asked for this a few times already).

On Thu, Nov 7, 2019, 10:36 PM Derek Nelson notifications@github.com wrote:

If we go down this route, then it's probably ok to keep all the meta columns flat, i.e. no nesting in a STRUCT

I think the key thing is storing metadata columns in such a way that adding new meta columns doesn't break the existing storage layout.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/confluentinc/ksql/issues/3734?email_source=notifications&email_token=ABCXJIHCXPPDWXHXQF35DH3QSSJ6DA5CNFSM4JIVFJT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDOCH5Y#issuecomment-551298039, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCXJIAKEUEDNDNYG43XYATQSSJ6DANCNFSM4JIVFJTQ .