Closed Arnaud-Nauwynck closed 2 days ago
that's an apache jira, not parquet, please file on https://issues.apache.org/jira/ ; project HADOOP, component fs/azure, tag with the version of the hadoop binaries
HADOOP-18884 [ABFS] Support VectorIO in ABFS Input Stream, is actually what you want, and although Anmol @ microsoft has assigned it to himself, we've not seen any code yet.
Sorry, I misclick in github project parquet instead of hadoop. I have recreated (https://issues.apache.org/jira/browse/HADOOP-19345, as I did not saw you comment. Will mark it as duplicates in Jira.
Describe the enhancement requested
In hadoop-azure, there are huge performance problems when reading file in a too fragmented way: by reading many small file fragments even with the readVectored() Hadoop API, resulting in distinct Https Requests (=TCP-IP connection established + TLS handshake + requests). Internally, at lowest level, haddop azure is using class HttpURLConnection from jdk 1.0, and the ReadAhead Threads do not sufficiently solve all problems. The hadoop azure implementation of "readVectored()" should make a compromise between reading extra ignored data wholes, and establishing too many https connections.
Currently, the class AzureBlobFileSystem#open() does return a default inneficient imlpementation of readVectored:
see default implementation of FSDataInpustStream.readVectored:
it calls the underlying method from class AbfsInputStream, which is not overriden:
AbfsInputStream should override this method, and accept internally to do less Https calls, with merged range, and ignore some returned data (wholes in the range).
It is like honouring the parameter of hadoop FSDataInputStream (implements PositionedReadable)
Even this 4096 value is very conservative, and should be redined by AbfsFileSystem to be 4Mo or even 8mo.
ask chat gpt: "on Azure Storage, what is the speed of getting 8Mo of a page block, compared to the time to establish a https tls handshake ?" The response (untrusted from chat gpt..) says : HTTPS/TLS Handshake: ~100–300 ms ... is generally slower than downloading 8 MB from Page Blob: on Standard Tier: ~100–200 ms / on Premium Tier: ~30–50 ms
Azure Abfsclient already setup by default a lot of Threads for Prefecth Read Ahead, to prefetch 4Mo of data, but it is NOT sufficent, and less efficient that simply implementing correctly what is already in Hadoop API : readVectored(). It also have the drawback of reading tons of useless data (past parquet blocks), that are never used.
Component(s)
No response