apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

AzureBlobFileSystem.open() should return a sub-class fsDataInputStream that override readVectored() much more efficiently for small reads #3077

Closed Arnaud-Nauwynck closed 2 days ago

Arnaud-Nauwynck commented 2 days ago

Describe the enhancement requested

In hadoop-azure, there are huge performance problems when reading file in a too fragmented way: by reading many small file fragments even with the readVectored() Hadoop API, resulting in distinct Https Requests (=TCP-IP connection established + TLS handshake + requests). Internally, at lowest level, haddop azure is using class HttpURLConnection from jdk 1.0, and the ReadAhead Threads do not sufficiently solve all problems. The hadoop azure implementation of "readVectored()" should make a compromise between reading extra ignored data wholes, and establishing too many https connections.

Currently, the class AzureBlobFileSystem#open() does return a default inneficient imlpementation of readVectored:

  private FSDataInputStream open(final Path path,
      final Optional<OpenFileParameters> parameters) throws IOException {
...
      InputStream inputStream = getAbfsStore().openFileForRead(qualifiedPath, parameters, statistics, tracingContext);
      return new FSDataInputStream(inputStream);  // <== FSDataInputStream is not efficiently overriding readVectored() !
 }

see default implementation of FSDataInpustStream.readVectored:

    public void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) throws IOException {
        ((PositionedReadable)this.in).readVectored(ranges, allocate);
    }

it calls the underlying method from class AbfsInputStream, which is not overriden:

    default void readVectored(List<? extends FileRange> ranges, IntFunction<ByteBuffer> allocate) throws IOException {
        VectoredReadUtils.readVectored(this, ranges, allocate);
    }

AbfsInputStream should override this method, and accept internally to do less Https calls, with merged range, and ignore some returned data (wholes in the range).

It is like honouring the parameter of hadoop FSDataInputStream (implements PositionedReadable)

  /**
   * What is the smallest reasonable seek?
   * @return the minimum number of bytes
   */
  default int minSeekForVectorReads() {
    return 4 * 1024;
  }

Even this 4096 value is very conservative, and should be redined by AbfsFileSystem to be 4Mo or even 8mo.

ask chat gpt: "on Azure Storage, what is the speed of getting 8Mo of a page block, compared to the time to establish a https tls handshake ?" The response (untrusted from chat gpt..) says : HTTPS/TLS Handshake: ~100–300 ms ... is generally slower than downloading 8 MB from Page Blob: on Standard Tier: ~100–200 ms / on Premium Tier: ~30–50 ms

Azure Abfsclient already setup by default a lot of Threads for Prefecth Read Ahead, to prefetch 4Mo of data, but it is NOT sufficent, and less efficient that simply implementing correctly what is already in Hadoop API : readVectored(). It also have the drawback of reading tons of useless data (past parquet blocks), that are never used.

Component(s)

No response

steveloughran commented 2 days ago

that's an apache jira, not parquet, please file on https://issues.apache.org/jira/ ; project HADOOP, component fs/azure, tag with the version of the hadoop binaries

HADOOP-18884 [ABFS] Support VectorIO in ABFS Input Stream, is actually what you want, and although Anmol @ microsoft has assigned it to himself, we've not seen any code yet.

Arnaud-Nauwynck commented 2 days ago

Sorry, I misclick in github project parquet instead of hadoop. I have recreated (https://issues.apache.org/jira/browse/HADOOP-19345, as I did not saw you comment. Will mark it as duplicates in Jira.