Inconsistency between vectorized processing boundary calculation in benchmark and readBatchUsing512Vector calculation method in ParquetReadRouter class?

Hello, I have recently become interested in using vector API Parquet Bit-Packing decode. In the course of researching the code, I found that in the ByteBitPackingVectorBenchmarks.java in the official test benchmarks in the parquet-plugins-benchmarks folder totalByteCountVector = totalBytesCount - inputByteCountPerVector; When this range is exceeded, unpack8Values() is used to decode the data, which ensures that there is enough room for a full vector operation at the end. But in readBatchUsing512Vector() totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512; I am wondering if this affects the throughput performance and logic implementation choices for different bit width decoding methods. My queries are as follows:

In this case, if it is the same amount of data, the number of vectorized decoding will be reduced, won't this affect the optimization effect after combining with Spark? For example, bitwidth = 3, outputValues = 2048; when ByteBitPackingVectorBenchmarks.java totalByteCountVector = totalBytesCount - inputByteCountPerVector. Here the decoding is done by 63 times unpackValuesUsingVector and 4 times unpack8value. When totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512; here the decoding is done by 59 times unpackValuesUsingVector, 20 times unpack8value.
I would like to ask here to reserve 64 bytes is to take into account the data out of bounds and other data security considerations? Is it possible to use totalByteCountVector = totalBytesCount - inputByteCountPerVector; in the readBatchUsing512Vector boundary calculation code?
Also if readBatchUsing512Vector keeps totalByteCountVector = totalBytesCount - BYTES_PER_VECTOR_512; the vectorization bounds here are enough to cover off the bounds-safe case, Loads a vector from an array of type byte[] starting at an offset and using a mask is necessary? As I understand it, boundary crossing for vectorized data processing would be mitigated if BYTES_PER_VECTOR_512 is used. The performance loss caused by using masks for boundary data reading safety is greater than static ByteVector fromArray(VectorSpecies species, byte[] a, int offset) which doesn't use masks but directly loads the excess part of the array unused. Would it be possible to consider loading a vector only from an array of type byte[] starting at offset, without using a mask? Although this would add some single decode extra data to the vector loading process? Or is there some other reason why I can't eliminate the mask, I'm just eagerly waiting for an answer.Sorry for my poor English.

Component(s)

Core, Benchmark

apache / parquet-java

Inconsistency between vectorized processing boundary calculation in benchmark and readBatchUsing512Vector calculation method in ParquetReadRouter class? #3073

Component(s)