apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1697: Fix IllegalArgumentException when reading json timestamp type in benchmark #1902

Open cxzl25 opened 2 months ago

cxzl25 commented 2 months ago

What changes were proposed in this pull request?

This PR aims to fix IllegalArgumentException when reading json timestamp type in bechmark.

Why are the changes needed?

ORC-1191 Switch the csv format of taxi to parquet and read the timestamp format of parquet, but it is in microseconds format, which is different from the millisecond format of Java's java.sql.Timestamp.

taxi source parquet meta

  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));

When we write the data into json and then use the scan command, we will get the following error.

java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
    at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
    at org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
    at org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
    at org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
    at org.apache.orc.bench.core.Driver.main(Driver.java:64)

If we use orc-tools to dump the generated ORC file metadata, the timestamp data is also incorrect.

    Column 2: count: 2053120 hasNull: false bytesOnDisk: 8113763 min: 47802-07-26 08:00:00.0 max: 47817-09-26 23:43:20.0
    Column 3: count: 2053120 hasNull: false bytesOnDisk: 8461151 min: 47802-07-26 08:00:00.0 max: 48731-09-12 15:43:20.0

If we use parquet-cli to dump the generated parquet metadata, we will have the same problem.

  optional int64 tpep_pickup_datetime (TIMESTAMP(MILLIS,true));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MILLIS,true));

tpep_pickup_datetime   INT64     Z _ R_ F  9170100   3.04 B     0       "+47802-07-26T00:00:00.000..." / "+47867-01-07T15:43:20.000..."
tpep_dropoff_datetime  INT64     Z _ R_ F  9170100   3.20 B     0       "+47802-07-26T00:00:00.000..." / "+48750-04-12T01:40:00.000..."

https://github.com/apache/orc/blob/952b4792f20eaf1bb63c0eb7319e03b9c3d7a3f1/java/bench/core/src/java/org/apache/orc/bench/core/convert/avro/AvroSchemaUtils.java#L92-L95


System.out.println(new Timestamp(1446341079000000L));
System.out.println(new Timestamp(1446341079000000L/1000));
47802-09-23 02:50:00.0
2015-11-01 09:24:39.0

How was this patch tested?

local test

java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json

output

data/generated/taxi/json.snappy rows: 22758236 batches: 22225
data/generated/taxi/json.gz rows: 22758236 batches: 22225
data/generated/sales/json.snappy rows: 25000000 batches: 24415
data/generated/sales/json.gz rows: 25000000 batches: 24415
data/generated/github/json.snappy rows: 10489642 batches: 10244
data/generated/github/json.gz rows: 10489642 batches: 10244

Was this patch authored or co-authored using generative AI tooling?

No

cxzl25 commented 2 months ago

@dongjoon-hyun @wgtmac Can you help review this PR, thanks in advance!