apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.48k stars 1.28k forks source link

Add Support for DATE datatype in Pinot #9575

Open jatin5251 opened 2 years ago

jatin5251 commented 2 years ago

We have use case where we need to store date datatype in pinot.

xiangfu0 commented 2 years ago

Can you elaborate more on the use case and curious if TIMESTAMP can be an alternative for it?

cc: @Jackie-Jiang

jatin5251 commented 2 years ago

We are using tableau for visualization and due to its licensing cost we are in process to deprecate it with superset But tableau has extract feature in which we save output of sql query in tableau server as in-memory. To add this functionality in superset we build one service that will take sql query from superset, run it on trino then trino-pinot connector will insert data into pinot(insert support added by @elonazoulay in trino )

In our service we first take the query from user, get its schema from trino and convert the schema in a way that will be compatible to pinot like bigint to long

As hive tables has support for date datatype but not in pinot then we need to ask users to cast date type to varchar in query.

As we have all primitive data type support in pinot except Date. We want to add this support in pinot so that users will have good experience. please let me know if you need more information Thanks

Jackie-Jiang commented 2 years ago

For the DATE type, do you need time zone info, or the plain text such as 2022-10-13? To support it, internally we can store the value as either long (millis since epoch) or int (days since epoch). We need to consider whether to support the comparison across different time types, e.g. comparing DATE with TIMESTAMP. We also want to introduce DATE_TIME in the future, which support custom date time pattern and time zone.

jatin5251 commented 2 years ago

we are only looking for plain text at the moment. i think we can store as long (millis since epoch) that will help in future once we introduce DATE_TIME in the future.

xiangfu0 commented 2 years ago

Actually, in PrestoDB, there is a config to infer DATE type based on Pinot Schema: pinot.infer-date-type-in-schema. https://prestodb.io/docs/current/connector/pinot.html#map-pinot-schema-to-presto-schema

I feel maybe we can implement this in Trino as a temporary workaround before this type is supported in Pinot.

kishoreg commented 2 years ago

+1 to add date as a first class data type. Will be useful as we integrate with tools other than presto/Trino

jatin5251 commented 2 years ago

@xiangfu0 we are inserting data through trino and reading directly from superset. i think this property will not help much in this case but thanks to your suggestion.

Jackie-Jiang commented 2 years ago

Related to #9546

walterddr commented 2 years ago

is this a duplicate of https://github.com/apache/pinot/issues/8045 or do we expect more / different from a DateTime type?

specifically is it the binary compatibility with Hive's date datatype that we desire here?

some of the observation, i can offer by unifying the 2 issues and the timezone PR based on: https://www.postgresql.org/docs/current/datatype-datetime.html

  1. byte format:

    • 8 bytes are generally used for timestamp type, storing "both date and time";
    • 4 bytes used for date type, storing: "date but no time of day";
    • i am not sure what's the best byte format for time type, as storing: "time of day, no date";
    • hive date / time didn't specify the byte format of the underlying storage.
  2. timezone:

    • postgres timestamp doesn't save timezone info; but some other system like presto/trino does save timezone;
    • timezone info can be stored as either a timezone string or the offset from UTC; however the format provides more info, see: tz_database
  3. output format:

    • postgres always store data in UTC timezone with the precision following the byte format, it doesn't store date format info, most other systems also follow the same - require users to provide what's the expected stringify format for the output

we might want to ask the question on what to support in each of the 3 categories, but my intuition is to follow postgresQL as they are the most documented

jatin5251 commented 2 years ago

@walterddr we are looking to have DATE type in pinot, we can insert date type from trino and read from python connector (https://pypi.org/project/pinotdb/ ).

jatin5251 commented 2 years ago

@Jackie-Jiang

Related to #9546

we are looking for date type support only in which we don't have any time part. let me know your thoughts

jatin5251 commented 10 months ago

@Jackie-Jiang @walterddr @xiangfu0 @kishoreg Guys, can we support this in near future?

Jackie-Jiang commented 9 months ago

@jatin5251 It is not on our top priority list, but contribution is very welcome.