apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.5k stars 1.29k forks source link

Make DateTimeFieldSpecs mainstream, deprecate TimeFieldSpec #2756

Open npawar opened 6 years ago

npawar commented 6 years ago

In the Pinot Schema, stop using TimeFieldSpec, in favor of DateTimeFieldSpec.

The TimeFieldSpec is not adequate for the following reasons

  1. A schema can have only one TimeFieldSpec. As a result, we can have only one time column in the schema.
  2. Users get around the above limitation by adding the other time columns as dimensions. However, dimension columns can only have attributes name and dataType. TimeFieldSpec has attributes such as time unit (HOURS, MINUTES), time format (EPOCH/SDF), pattern (yyyyMMdd), which help describe the time column in more detail. Adding additional time columns as dimensions makes us lose out on these attributes of the time column.
  3. TimeFieldSpec has attributes which describe the format of the time column. However, it doesn’t convey anything about the bucketing of the time column. On the other hand, DateTimeFieldSpec has a richer spec, wherein the bucketing information is also captured. Bucketing is a useful concept when time columns are expressed in formats different from the bucketing. For example, if the time column is expressed in epoch millis, but every value has been rounded to the nearest hour, we can set DateTimeFieldSpec as
    "dateTimeFieldSpecs": [
    {
      "name": "timestamp",
      "dataType": "LONG",
      "format": "1:MILLISECONDS:EPOCH",
      "granularity": "1:HOURS"
    }
    ]
  4. The conversion mechanism used in TimeFieldSpec is not very flexible. The conversion can only be done from an incoming TimeGranularitySpec to an outgoing TimeGranularitySpec.
npawar commented 4 years ago

Design Doc https://docs.google.com/document/d/1SU1jCjfsIDSA960fD5YWQbD72p8UdGF0c7CroFNt9Ho/edit?usp=sharing

snleee commented 4 years ago

I want to discuss about enforcing a uniform type & format and only allow users to pick the granularity for the primary time column, which is configured in the table config.

I think that having a uniform type & format for the primary column can simplify a lot of components because we won't need to support all different type formats when we need to play with the time. For example, segment roll-up config can be very simple - instead of having type, format, granularity, only granularity will be needed.

Another huge benefit will be an easier integration with external UIs such as Superset. I believe that Pinot-Superset connector have some extra logic to convert the time column values into a format that Superset understands. Since pinot's time column format/type will be different for each table, the external system now needs to know about Pinot's table configuration & schema (or similar time conversion config need to be specified for each table on Superset) for correctly parsing the time column value.

Since we make the interface change to the time column, it would be great if we can discuss about this.

@kishoreg @npawar @mayankshriv @Jackie-Jiang How do you guys think?

npawar commented 4 years ago

I agree this would be nice to have, and see the benefits we gain for merge/rollup and integrations. I would prefer if we first get to making dateTimeFieldSpec the default, and then focus on this. Once we have only dateTimeFieldSpec, it'll be easier to add tooling and validations for achieving this. I feel this change is kinda tricky as it is, and don't want to mix up goals.