apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.52k stars 1.29k forks source link

API SQL not following format field configuration for timestamp/datetime fields #12734

Open estebanz01 opened 7 months ago

estebanz01 commented 7 months ago

Hola! 👋

So playing around with pinot, I found a small issue that bothers me a bit, so I thought it might be good to raise it as an issue. I'm dealing with high frequency data, so we need to timestamp points with a microsecond precision. We could even have a case for nanoseconds, but at that point it's better to just deal with it on the edge. Anyhow, here's what I have as a schema:

Microseconds schema ```json { "schemaName": "data_counting_temp", "dimensionFieldSpecs": [ { "name": "device_uuid", "dataType": "STRING" } ], "metricFieldSpecs": [ { "name": "noise", "dataType": "LONG" } ], "dateTimeFieldSpecs": [ { "name": "__key", "dataType": "TIMESTAMP", "format": "1:MICROSECONDS:EPOCH", "granularity": "1:MICROSECONDS" }, { "name": "__metadata$eventTime", "dataType": "TIMESTAMP", "format": "1:MICROSECONDS:EPOCH", "granularity": "1:MICROSECONDS" } ] } ```

the data comes from pulsar, so that's why I have the __metadata$eventTime and __key fields.

The granularity is supported as it uses the time unit enums from here: https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/TimeUnit.html, the problem comes when querying with the API. If I look at the table data in the pinot UI, I get the following:

a pinot SQL table result showing wrong timestamp

which is fine as I thought it was converted on the frontend for styling purposes, but the real reason is that this data comes from the backend:

{
  "resultTable": {
    "dataSchema": {
      "columnNames": [
        "__key",
        "__metadata$eventTime"
      ],
      "columnDataTypes": [
        "TIMESTAMP",
        "TIMESTAMP"
      ]
    },
    "rows": [
      [
        "56202-05-11 12:33:13.431",
        "56202-05-11 12:33:13.431"
      ],
      [
        "56202-05-11 12:34:14.949",
        "56202-05-11 12:34:14.949"
      ],
      [
        "56202-05-11 12:34:37.971",
        "56202-05-11 12:34:37.971"
      ],
      [
        "56202-05-11 12:34:44.0",
        "56202-05-11 12:34:44.0"
      ],
      [
        "56202-05-11 12:34:54.065",
        "56202-05-11 12:34:54.065"
      ],
      [
        "56202-05-11 12:35:55.881",
        "56202-05-11 12:35:55.881"
      ],
      [
        "56202-05-11 12:36:19.978",
        "56202-05-11 12:36:19.978"
      ],
      [
        "56202-05-11 12:36:24.787",
        "56202-05-11 12:36:24.787"
      ],
      [
        "56202-05-11 12:36:36.25",
        "56202-05-11 12:36:36.25"
      ],
      [
        "56202-05-11 12:37:36.541",
        "56202-05-11 12:37:36.541"
      ]
    ]
  },
  "requestId": "1093678062000000000",
  "brokerId": "<broker-id>",
  "exceptions": [],
  "numServersQueried": 4,
  "numServersResponded": 4,
  "numSegmentsQueried": 2632,
  "numSegmentsProcessed": 337,
  "numSegmentsMatched": 4,
  "numConsumingSegmentsQueried": 1,
  "numConsumingSegmentsProcessed": 1,
  "numConsumingSegmentsMatched": 0,
  "numDocsScanned": 40,
  "numEntriesScannedInFilter": 0,
  "numEntriesScannedPostFilter": 80,
  "numGroupsLimitReached": false,
  "totalDocs": 42393394,
  "timeUsedMs": 365,
  "offlineThreadCpuTimeNs": 0,
  "realtimeThreadCpuTimeNs": 0,
  "offlineSystemActivitiesCpuTimeNs": 0,
  "realtimeSystemActivitiesCpuTimeNs": 0,
  "offlineResponseSerializationCpuTimeNs": 0,
  "realtimeResponseSerializationCpuTimeNs": 0,
  "offlineTotalCpuTimeNs": 0,
  "realtimeTotalCpuTimeNs": 0,
  "brokerReduceTimeMs": 13,
  "segmentStatistics": [],
  "traceInfo": {},
  "minConsumingFreshnessTimeMs": 1711547747582,
  "numSegmentsPrunedByBroker": 0,
  "numSegmentsPrunedByServer": 2295,
  "numSegmentsPrunedInvalid": 0,
  "numSegmentsPrunedByLimit": 0,
  "numSegmentsPrunedByValue": 2295,
  "explainPlanNumEmptyFilterSegments": 0,
  "explainPlanNumMatchAllFilterSegments": 0,
  "numRowsResultSet": 10
}

Basically, when querying this two fields, I'm not getting the epoch number that I expect, but a naive conversion to yyyy-MM-dd hh:mm:ss.sssssZ. Shouldn't the API SQL check for the format and based on that, either apply transformations or get the raw value ?

My workaround is ugly, but how well. Use timeconvert as follows:

SELECT timeConvert(__key, 'MICROSECONDS', 'MICROSECONDS') as epoch FROM table

which gives me the expected info: another table results showing correct epochs

Let me know if I should close this. Thanks in advance!

Jackie-Jiang commented 7 months ago

Pinot TIMESTAMP data type takes millis since epoch time, and 56202-05-11 12:33:13.431 is the corresponding timestamp if we treat the micros as millis. The workaround is to store the value as LONG

hpvd commented 6 months ago

@estebanz01 does this hint solves your issue and it can be closed?

estebanz01 commented 6 months ago

hi @hpvd it did solved my issue, but I don't think this issue should be closed. Shouldn't the format field tell the engine how to render properly the info?

Jackie-Jiang commented 6 months ago

TIMESTAMP data type has implicit format of yyyy-MM-dd HH:mm:ss.SSS or millis since epoch.

I think the confusion is coming from the fact that both format and granularity field within DateTimeFieldSpec are for explanation purpose instead of conversion purpose. I.e. if the value doesn't align with the spec, Pinot won't automatically convert the value. We should consider adding a feature to automatically convert the value if it doesn't align with the spec. cc @snleee @swaminathanmanish