Closed mightyguava closed 2 years ago
- v., k., t., and h. prefixes for selecting value, key, timestamp, and header are arbitrary and unintuitive
I agree very much with this.
- each column's selector is a string evaluated by a separate parser
This specific issue could be solved by pulling that sub-parser into the CREATE SOURCE
parser directly eg.
columnselectors = (
v.transaction.token,
v.customer_token,
v.transaction.total_balance_impact.amount,
v.transaction.total_balance_impact.currency_code,
v.occurred_at,
k
);
We may have quite a lot of uses for CASE in Prana. It's limiting that it can only return a single column.
I think this is a good point, particularly with oneof
s as you've illustrated.
A few additional data points to consider:
I do wonder if a more domain specific language might be better than a normal SELECT, for the reasons you've stated, but I don't know what that would be.
One issue I noticed while merging the parsers is that SQL is case insensitive, while the selector is necessarily case sensitive.
LGTM. For what it's worth, CashEventing does implement the CloudEvent idea so a few header fields are populated. Some teams extract these fields from the payload, or let them default to auto generated values 😬 . All that said, I can't really imagine writing a prana query based on the headers that we have.
Items 1 and 2 are complete. I'm going to leave 3 alone for now, leaving this ticket open.
Closing this, as I don't think there's a strong case for making major changes to the create source syntax at this point.
We'll need to extend
columnselectors
in CREATE SOURCE significantly to support rather complicated topics likeledger_events_2
. I'd like to propose a slightly different syntax to make the final result slightly closer to SQL spec and hopefully easier to use.The existing syntax:
A few issues I have with it are, in no particular order:
v.
,k.
,t.
, andh.
prefixes for selecting value, key, timestamp, and header are arbitrary and unintuitivecolumnselector
is embedded as part of the topic configuration, when it should probably be standaloneHere's a draft syntax that tries to solve these problems:
This borrows from the
CREATE TABLE <tablename> AS SELECT .... FROM <tablename>
syntax, already used forCREATE MATERIALIZED VIEW
, and common to most popular dialects like MySQL and PostgreSQL. In reference to the previous issues mentioned:1.
v.
,k.
,t.
, andh.
prefixes for selecting value, key, timestamp, and header are arbitrary and unintuitiveThe "value" (message body) is going to be the most commonly selected from. It is promoted to the top level here, allowing top-level fields to be selected without the
v.
prefix. Keys, headers, and message timestamp will be provided by the specialmeta
function, usingmeta('key')
,meta('headers')
, andmeta('timestamp')
. Function name suggestions welcome.On Cash, these meta fields are unlikely to ever get used.
key
is usually some arbitrary partition key (sometimes random/incoherent) that is always available in the message body. We don't use headers AFAIK. And finally the timestamp the message is usually unimportant, while the timestamp of the underlying triggering event is usually in the message body, likeoccurred_at
above.2. each column's selector is a string evaluated by a separate parser
Column selectors are now just projections in the SELECT statement, becoming part of the main grammar. This is important as now we can reuse SQL functions and other language features here. For the
ledger_events_2
topic specifically, we can use theCASE
statement, e.g.In the above,
which_one_of
is a function that returns the name of the populated protoone_of
field. We use a standard SQLCASE
statement to switch on the result of thewhich_one_of
to get the amount. (This is just for demonstration purposes... I have no idea what the fields in the message actually mean).We may have quite a lot of uses for
CASE
in Prana. It's limiting that it can only return a single column. We may want to extend it to support returning multiple columns, for example:Aside: this gets quite wordy, so it might be useful in the future to support defining temporary variables or multiple select pipeline stages, like
def x = transaction.total_balance_impact
, or jq-like filters.3.
columnselector
is embedded as part of the topic configurationTopic configuration now goes after the
FROM
keyword, as thekafka
function, which returns a "virtual table" that is the source of the data. This feels cleaner, and we can have other functions for defining different data sources orthogonal to the SELECT syntax... though SQL functions aren't supposed to have a named argument syntax, so that needs some rethinking.