GH-2990: Only call hsync() and hflush() on supported filesystems

apache / parquet-java

Apache Parquet Java

https://parquet.apache.org/

Apache License 2.0

2.65k stars 1.41k forks source link

GH-2990: Only call hsync() and hflush() on supported filesystems #2991

Open CZuegner opened 3 months ago

CZuegner commented 3 months ago

Instead of log the unsupported call check capabilities and call only on supported filesystems - e.g. S3A does not.

Rationale for this change

When stream into an HadoopOutputFile on S3A a waring gets logged: Application invoked the Syncable API against stream writing to XXX. This is Unsupported https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#UnsupportedOperationException_.E2.80.9CS3A_streams_are_not_Syncable._See_HADOOP-17597..E2.80.9D

What changes are included in this PR?

Instead of log the unsupported call (hflush() and hsync()) check capabilities and call only on supported filesystems - whereas S3A is not.

Are these changes tested?

Yes

Are there any user-facing changes?

Closes: #2990

wgtmac commented 3 months ago

The CI failures are related:

[INFO] -------------------------------------------------------------
Error:  COMPILATION ERROR : 
[INFO] -------------------------------------------------------------
Error:  /home/runner/work/parquet-java/parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopPositionOutputStream.java:[54,16] cannot find symbol
  symbol:   method hasCapability(java.lang.String)
  location: variable wrapped of type org.apache.hadoop.fs.FSDataOutputStream
Error:  /home/runner/work/parquet-java/parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopPositionOutputStream.java:[67,15] cannot find symbol
  symbol:   method hasCapability(java.lang.String)
  location: variable fdos of type org.apache.hadoop.fs.FSDataOutputStream

steveloughran commented 1 month ago

that compiler failure means you are running against a very old version of hadoop, 2.8 or earlier as the change is from https://issues.apache.org/jira/browse/HDFS-11644

Keeping the entire hadoop-2/2.7.3 is really preventing the library from using the modern, especially cloud-friendlier APIs -including hadoop 2.9 APIs to probe for capabilities.

Compare with spark which is on 3.4.0.

Cut it and everyone's life will be much better. Doesn't have to be 3.4.x, but the latest 3.3.x release (3.3.x)

steveloughran commented 1 month ago

commented on this again.

that warning is only printed once per process, though it is potentially a sign of a dangerous mismatch between application code and the apps (hbase, some streaming logs)

what we could do (and I'll take a hadoop PR) to give that warning message a new log name which is only used for this message. org.apache.hadoop.fs.s3a.needless for example. 😀

you can have it in hadoop 3.4.1 if you do a timely PR

otherwise, #2944 will fix the build problems

davidvoit commented 1 month ago

@steveloughran I'm a colleague of Christian and we worked together on this patch. If we add a own category this makes it just easier to filter out the message, but in the end hsync just don't make any sense for parquet and object storeage, or would you disagree?

I think the hasCapalipty route is the best one here. Object storage are always atomic so don't need hsync at all. For hdfs the code still does do the hsync as it has the capality. The warning still makes absolute sense for stuff like hbase, which should not be used as is together with an object storage driver, but parquet as is doesn't has this requirment, and works fine without hsync.

Sure we can wait for 2944 or if we should change something just let us now :-)

steveloughran commented 1 month ago

hsync is overkill, as is flushing. It does make sense for things like streaming logs where you want to be confident it has been persisted even if your app crashes. The applications generally creating Parquet files (hive, spark...) implement failure resilience at a higher level. I wouldn't bother at all.

davidvoit commented 1 month ago

So should we modify the pull request and just remove the hsync and hflush calls?

steveloughran commented 1 month ago

worksforme. the only special case is: is someone using this in any commit algorithm where returning is viewed as a sign that it has been persisted? I'm thinking of stuff like iceberg here

CZuegner commented 1 month ago

I've changed the PR according your thoughts.