apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

GH-2994: Optimize string to binary conversion in AvroWriteSupport #2995

Closed sschepens closed 2 months ago

sschepens commented 3 months ago

Rationale for this change

Binary.fromCharSequence is an order of magnitud slower than Binary.fromString when input is a String:

Benchmarks.fromCharSequence  thrpt   25   5885347.328 ±  186669.738  ops/s
Benchmarks.fromString        thrpt   25  71335979.492 ± 8800704.044  ops/s

Here is the code for the benchmarks:

public class Benchmarks {
    private static final String string = RandomStringUtils.randomAlphanumeric(100);

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public void fromCharSequence(Blackhole blackhole) {
        blackhole.consume(Binary.fromCharSequence(string));
    }

    @Benchmark
    @BenchmarkMode(Mode.Throughput)
    public void fromString(Blackhole blackhole) {
        blackhole.consume(Binary.fromString(string));
    }
}

What changes are included in this PR?

Change AvroWriteSupport.fromAvroString() to use Binary.fromString when operating with string inputs.

Are these changes tested?

Current tests should cover the change

Are there any user-facing changes?

No

Closes #2994