Open cxzl25 opened 9 months ago
Gentle ping, @cxzl25 .
This is a high-risky way which Apache Hadoop community is also not recommended.
May I ask if you have any reference in ASF communities which adopts this hack
@Hexiaoqiao Can you help review this PR?
Thanks involve me here. I have left my concerns here, I would like to reassert: I believe this could reduce load to NameNode when using ORC on HDFS but I am worried it involve any potential issues from HDFS view.
Back to this PR, IIUC this is used only for extract footer, right? Technically, for one immutable file, the result from DFSInputStream#getFileLength()
and FileStatus#getLen()
will be same, but it is not one recommended usage as @dongjoon-hyun mentioned above.
Thanks again.
What changes were proposed in this pull request?
If the file in HDFS is in a completed state, avoid calling the HDFS getFileInfo RPC.
Provide
orc.file.length.fast
configuration to enable this behavior.Why are the changes needed?
Now reading an ORC file in HDFS will generate at least one
open
andgetFileInfo
RPC. This optimization can remove getFileInfo RPC as much as possible, improve ORC reading efficiency, and reduce the number of HDFS RPCs.How was this patch tested?
The production environment has been running stably for several months
Was this patch authored or co-authored using generative AI tooling?
No