airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
673 stars 36 forks source link

CHIP-2: Fluent API for Chronon #762

Open nikhilsimha opened 1 month ago

nikhilsimha commented 1 month ago

Context

Our API to create objects currently takes all arguments at once. We use this style to create objects such as Source, GroupBy, Join etc. We are essentially exposing functions with 10s of arguments. This could lead to cognitive burden for the author and the reader.

This doc outlines a fluent API inspired in-part by PRQL. Example below.

from invoices
filter invoice_date >= @1970-01-16
filter income > 1
group customer_id (
  aggregate {
    average total,
    sum_income = sum income,
    ct = count total,
  }
)

Goals

Non-Goals

Why not directly adopt PRQL?

Like SQL, PRQL doesn't support the following features that Chronon supports

Approach

We outline examples to build Source, GroupBy and a Join using a fluent API. Subsequent sections have examples for each of the object type.

Building Sources

From.fact(table, topic).in_range(start, end) 
    .with_timestamp("ts") # optional, goes anywhere - defaults to ts
    .where(clause1, clause2) # can use derive aliases in clause 1 or 2
    .select(col1=expr1, col2=expr2)

Basic Source - without any transformations.

In sql the very first statement that runs is the from clause. The root of the API is From - which describes the raw source without any transformations whatsoever.

From has 5 methods:

Chronon also allows connecting a join as a source to another groupBy - referred to as chaining. We also allow directly passing StagingQueries when annotated with the data model using .as_fact(), .as_dim() or .as_scd(). At this point, we have a class that defines which tables and topics to read, and the model of data such as fact, dim or scd.

Filtered Source - applying where clause

In sql the second statement that runs, when present is the where clause. PRQL chooses to call this filter, we retain SQL's where. Where has a more precise connotation than filter (where = filter-in, where + not = filter out).

.where(clause1, clause2)

Clauses are joined together with an AND - similar to the base Chronon API.

Why we won't need derive clauses

Very often you want transform a table column a certain way and re-use it for filtering and the select clauses. PRQL, supports a derive method, that substitutes an expression into select and where clauses that are immediately downstream.

Since we have direct access to python - we can instead directly use python variables and f-strings to achieve more powerful substitutions.

Select clause

As mentioned, in Chronon the select clause is exclusively reserved for projections. Works same as before.

Group By

At this point we have a fully formed source we can now support union, groupBy and join on the fully formed source


src
    .groupBy(key1, key2)
    .defaultWindows([1d, 7d])
    .agg(
        last_k(col1, k = 10, windows = [1d, 7d], buckets = [bucket1, bucket2]),
        approx_uniq(col2, lg_k = 20),
        last(col3, buckets = [bucket1, bucket2])
    )
    .temporalAccuracy() # optional
    .derive(alias1 = expr1, alias2 = expr2)