apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.42k stars 1.27k forks source link

Support for Avro logical types in realtime and offline tables #7357

Closed ddcprg closed 2 years ago

ddcprg commented 3 years ago

It seems like Avro logical types are not supported in the Avro message decoders.

I have been browsing through the code and checked the past issues and I have not found mentions to Avro logical types.

I will be happy to open a PR when I work out the best place and how to add these changes.

More details about this change:

Avro has support for logical types - more details at Avro spec docs. A subject schema can therefore be defined as

{
...
    {
      "name": "amount",
      "type": {
        "logicalType": "decimal",
        "precision": 64,
        "scale": 2,
        "type": "bytes"
      }
    },
...
}

Pinot Avro decoder does not support logical types at the moment there for trying to represent this value a String column will result in a wrong representations of the value. Trying to represent this value as a Float/Double column will results in a exception:

Table schema:

{
  "schemaName": "some_table",
  ...
  "metricFieldSpecs": [
    ...
    {
      "name": "amount",
      "dataType": "DOUBLE",
      "defaultNullValue": 0
    },
    ...
  ],
  ...
}

Exception:

Caught exception while transforming the record: {
  "fieldToValueMap" : {
    ...
    "amount" : "abc=",
    ...
  },
  "nullValueFields" : [ ]
}
java.lang.RuntimeException: Caught exception while transforming data type for column: amount
    at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:95) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.segment.local.recordtransformer.CompositeTransformer.transform(CompositeTransformer.java:83) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:514) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:417) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:564) [pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.lang.UnsupportedOperationException: Cannot convert value from BYTES to DOUBLE
    at org.apache.pinot.common.utils.PinotDataType$12.toDouble(PinotDataType.java:617) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.common.utils.PinotDataType$8.convert(PinotDataType.java:429) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.common.utils.PinotDataType$8.convert(PinotDataType.java:386) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    at org.apache.pinot.segment.local.recordtransformer.DataTypeTransformer.transform(DataTypeTransformer.java:90) ~[pinot-all-0.8.0-SNAPSHOT-jar-with-dependencies.jar:0.8.0-SNAPSHOT-bb34406aaa205f5c85b88c928d477fd267eda1b4]
    ... 5 more

Ideally Pinot should be able to perform automatic logical type conversion if the type details are provided.

xiangfu0 commented 3 years ago

You can find the Avro reader here:

https://github.com/apache/pinot/blob/c907dca917b208cade9d46d2e0804c335901d9b2/pinot-plugins/pinot-input-format/pinot-avro-base/src/main/java/org/apache/pinot/plugin/inputformat/avro/AvroSchemaUtil.java#L29

Looking forward and thanks for your contribution!

ddcprg commented 3 years ago

Thank you for your prompt response! I'm into this issue now

mayankshriv commented 3 years ago

Thanks for taking this up @ddcprg. Do you mind sharing why you would want to limit this to Realtime tables only? IMHO, we should do this in a generic way at record reader level?

ddcprg commented 3 years ago

Hi @mayankshriv I've initially thought of realtime tables and not the import jobs as I'm not very familiar the source code yet. I can extend the PR to include record reader as well

ddcprg commented 3 years ago

@mayankshriv my PR should cover both realtime tables and import jobs, please let me know if this is not the case.