apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1697: Fix IllegalArgumentException when reading json timestamp type in benchmark #1930

Open cxzl25 opened 1 month ago

cxzl25 commented 1 month ago

What changes were proposed in this pull request?

This PR aims to fix IllegalArgumentException when reading json timestamp type in benchmark.

Write and read json, convert timestamp type to long type instead of string type.

Why are the changes needed?

ORC-1191 Switch the csv format of taxi to parquet and read the timestamp format of parquet, but it is in microseconds format, which is different from the millisecond format of Java's java.sql.Timestamp.

taxi source parquet meta

  optional int64 tpep_pickup_datetime (TIMESTAMP(MICROS,false));
  optional int64 tpep_dropoff_datetime (TIMESTAMP(MICROS,false));

When we write the data into json and then use the scan command, we will get the following error.

java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
    at java.sql/java.sql.Timestamp.valueOf(Timestamp.java:224)
    at org.apache.orc.bench.core.convert.json.JsonReader$TimestampColumnConverter.convert(JsonReader.java:175)
    at org.apache.orc.bench.core.convert.json.JsonReader.nextBatch(JsonReader.java:86)
    at org.apache.orc.bench.core.convert.ScanVariants.run(ScanVariants.java:92)
    at org.apache.orc.bench.core.Driver.main(Driver.java:64)

Because json data of type timestamp is written via java.sql.Timestamp#toString, but reading the data java.sql.Timestamp#valueOf will report an error.

    Timestamp ts = new Timestamp(1446341079000000L);
    System.out.println(ts);
    System.out.println(Timestamp.valueOf(ts.toString()));
47802-09-23 02:50:00.0
Exception in thread "main" java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]
    at java.sql.Timestamp.valueOf(Timestamp.java:237)

How was this patch tested?

local test

java -jar core/target/orc-benchmarks-core-*-uber.jar generate data -format json -data taxi -compress snappy
java -jar core/target/orc-benchmarks-core-*-uber.jar scan data -format json -data taxi -compress snappy

Was this patch authored or co-authored using generative AI tooling?

No