apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.27k stars 1.23k forks source link

Stream ingestion Kafka to Pinot error invalid utf-8 start byte #10931

Open AnhQuanTran opened 1 year ago

AnhQuanTran commented 1 year ago

Hi everyone,

Pinot version: 0.12.1 Java JDK: 11

I tried to ingestion data realtime from kafka to pinot with 2 cases: Case 1: It work like charm {"event":{"header": "v1","body":{"gender":"M", "region":"London"}}}

Case 2: Error But when i use data contain some special character like this "Â" (Vietnamese word)

[{"event":{"header": "v1","body":{"gender":"M", "region":"Âmerica"}}}](url)

I faced an error

org.apache.pinot.shaded.com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x96

It's seem problem cause by com.fasterxml.jackson library. How to resolve it? Thank you so much

Detail exeption

org.apache.pinot.shaded.com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x96
at [Source: (ByteArrayInputStream); line: 1, column: 410]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2337) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:710) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidInitial(UTF8StreamJsonParser.java:3607) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidChar(UTF8StreamJsonParser.java:3603) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2545) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishAndReturnString(UTF8StreamJsonParser.java:2471) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:302) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.deser.std.BaseNodeDeserializer.deserializeObject(JsonNodeDeserializer.java:286) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:69) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.deser.std.JsonNodeDeserializer.deserialize(JsonNodeDeserializer.java:16) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.deser.DefaultDeserializationContext.readRootValue(DefaultDeserializationContext.java:322) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.ObjectReader._bindAsTree(ObjectReader.java:2070) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.ObjectReader._bindAndCloseAsTree(ObjectReader.java:2044) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.shaded.com.fasterxml.jackson.databind.ObjectReader.readTree(ObjectReader.java:1739) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.spi.utils.JsonUtils.bytesToJsonNode(JsonUtils.java:211) ~[pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder.decode(JSONMessageDecoder.java:61) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder.decode(JSONMessageDecoder.java:73) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.plugin.inputformat.json.JSONMessageDecoder.decode(JSONMessageDecoder.java:37) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.spi.stream.StreamDataDecoderImpl.decode(StreamDataDecoderImpl.java:47) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.processStreamEvents(LLRealtimeSegmentDataManager.java:549) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager.consumeLoop(LLRealtimeSegmentDataManager.java:434) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at org.apache.pinot.core.data.manager.realtime.LLRealtimeSegmentDataManager$PartitionConsumer.run(LLRealtimeSegmentDataManager.java:629) [pinot-all-0.12.1-jar-with-dependencies.jar:0.12.1-6e235a4ec2a16006337da04e118a435b5bb8f6d8]
    at java.lang.Thread.run(Thread.java:834) [?:?]
AnhQuanTran commented 1 year ago

@mayankshriv Can u help me, bro?

xiangfu0 commented 1 year ago

I think the issue is that your json is not utf-8 encoded correctly and pinot is using utf-8 to read the bytes. image