apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
683 stars 481 forks source link

[Java] Different semantic of lengths for CHAR(n) with C++ #1973

Open SiasDoming opened 3 months ago

SiasDoming commented 3 months ago

I'm migrating from Core-C++ to Core-Java. But while reading data of type CHAR(n), I found the BytesColumnVector.length in Java has a different semantic compared with StringVectorBatch.length in C++. In Java, with the following code, it refers to the number of bytes with padding blanks trimmed, while length in C++ refers to the total number of bytes including padding blanks. For example, reading value 'ABC' of CHAR(10) in Java will get a length 3 instead of 10 in C++. I'm wondering why trimmed lengths are preferred in Java. PS: Maybe any one of these implementation is acceptable for you, as long as the semantics are same among APIs of different programming languages, but I have to say that the 'redundant' processing in Java did annoy me. I have to reallocate a byte array and pad the bytes again manually for further usage. And the trimmed lengths prevent me from using direct memory copy (although this is still achievable if I'd like to depend on the internal implementation).

  public static class CharTreeReader extends StringTreeReader {
  ...
    @Override
    public void nextVector(ColumnVector previousVector,
                           boolean[] isNull,
                           final int batchSize,
                           FilterContext filterContext,
                           ReadPhase readPhase) throws IOException {
      ...
        // TreeReaderFactory.java:2474
        // TreeReaderFactory.java:2483
        // TreeReaderFactory.java:2493
        adjustedDownLen = StringExpr
            .rightTrimAndTruncate(result.vector[i], result.start[i], result.length[i], maxLength);
        if (adjustedDownLen < result.length[i]) {
          result.setRef(i, result.vector[i], result.start[i], adjustedDownLen);
        }
      ...
    }
  }
ffacs commented 3 months ago

Hi @SiasDoming , it seems there is a proposal in 2015 to provide a option, but was not implemented yet. FYI: https://issues.apache.org/jira/browse/ORC-35