Open CZuegner opened 3 months ago
The CI failures are related:
[INFO] -------------------------------------------------------------
Error: COMPILATION ERROR :
[INFO] -------------------------------------------------------------
Error: /home/runner/work/parquet-java/parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopPositionOutputStream.java:[54,16] cannot find symbol
symbol: method hasCapability(java.lang.String)
location: variable wrapped of type org.apache.hadoop.fs.FSDataOutputStream
Error: /home/runner/work/parquet-java/parquet-java/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopPositionOutputStream.java:[67,15] cannot find symbol
symbol: method hasCapability(java.lang.String)
location: variable fdos of type org.apache.hadoop.fs.FSDataOutputStream
that compiler failure means you are running against a very old version of hadoop, 2.8 or earlier as the change is from https://issues.apache.org/jira/browse/HDFS-11644
Keeping the entire hadoop-2/2.7.3 is really preventing the library from using the modern, especially cloud-friendlier APIs -including hadoop 2.9 APIs to probe for capabilities.
Compare with spark which is on 3.4.0.
Cut it and everyone's life will be much better. Doesn't have to be 3.4.x, but the latest 3.3.x release (3.3.x)
commented on this again.
that warning is only printed once per process, though it is potentially a sign of a dangerous mismatch between application code and the apps (hbase, some streaming logs)
what we could do (and I'll take a hadoop PR) to give that warning message a new log name which is only used for this message. org.apache.hadoop.fs.s3a.needless for example. 😀
you can have it in hadoop 3.4.1 if you do a timely PR
otherwise, #2944 will fix the build problems
@steveloughran I'm a colleague of Christian and we worked together on this patch. If we add a own category this makes it just easier to filter out the message, but in the end hsync just don't make any sense for parquet and object storeage, or would you disagree?
I think the hasCapalipty route is the best one here. Object storage are always atomic so don't need hsync at all. For hdfs the code still does do the hsync as it has the capality. The warning still makes absolute sense for stuff like hbase, which should not be used as is together with an object storage driver, but parquet as is doesn't has this requirment, and works fine without hsync.
Sure we can wait for 2944 or if we should change something just let us now :-)
hsync is overkill, as is flushing. It does make sense for things like streaming logs where you want to be confident it has been persisted even if your app crashes. The applications generally creating Parquet files (hive, spark...) implement failure resilience at a higher level. I wouldn't bother at all.
So should we modify the pull request and just remove the hsync and hflush calls?
worksforme. the only special case is: is someone using this in any commit algorithm where returning is viewed as a sign that it has been persisted? I'm thinking of stuff like iceberg here
I've changed the PR according your thoughts.
Instead of log the unsupported call check capabilities and call only on supported filesystems - e.g. S3A does not.
Rationale for this change
When stream into an HadoopOutputFile on S3A a waring gets logged: Application invoked the Syncable API against stream writing to XXX. This is Unsupported https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/troubleshooting_s3a.html#UnsupportedOperationException_.E2.80.9CS3A_streams_are_not_Syncable._See_HADOOP-17597..E2.80.9D
What changes are included in this PR?
Instead of log the unsupported call (hflush() and hsync()) check capabilities and call only on supported filesystems - whereas S3A is not.
Are these changes tested?
Yes
Are there any user-facing changes?
No
Closes: #2990