NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3

ottobricks commented 3 years ago

Issue writing to AWS S3 via the aws-java-sdk in spark context

Describe the bug

For a given DataFrame df in a PySpark env, the operation df.write.parquet("s3a://some-bucket/test.parquet") starts nicely but fails once concurrency occurs and the SDK calls org.apache.hadoop.util.SemaphoredDelegatingExecutor, which returns a java.lang.NoSuchMethodError

Expected Behavior

Ideally, the files should be written without any issues.

Current Behavior

The write operation fails with the following stack trace in the worker:

Py4JJavaError: An error occurred while calling o8285.parquet.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:226)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:178)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:131)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:175)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:213)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:210)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:171)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:122)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:121)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:963)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:963)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:399)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:288)
    at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:848)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 155 in stage 18.0 failed 1 times, most recent failure: Lost task 155.0 in stage 18.0 (TID 649, ip-172-16-68-128.ec2.internal, executor driver): java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Ljava/util/concurrent/ExecutorService;IZ)V
    at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:824)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1118)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1098)
    at org.apache.parquet.hadoop.util.HadoopOutputFile.create(HadoopOutputFile.java:74)
    at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:248)
    at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:390)
    at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

The relevant part for this issue is:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 155 in stage 18.0 failed 1 times, most recent failure: Lost task 155.0 in stage 18.0 (TID 649, ip-172-16-68-128.ec2.internal, executor driver): java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Ljava/util/concurrent/ExecutorService;IZ)V

This issue has been reported in Apache's JIRA HADOOP-16080

Steps to Reproduce

Environment: pyspark 3.0.1 with hadoop 3.2 and aws-java-sdk 1.11.95x Provided you have already set up your spark context, assigned in this example to sc, the code to reproduce the error is:

sqlContext = pyspark.sql.SparkSession(sc)

df = sqlContext.createDataFrame(
    [(1, "a"), (2, "b"), (3, "c")],
    ["ID", "Text"]
)

df.write.parquet("s3a://some_bucket/test.parquet")

Possible Solution

From HADOOP-16080:

The problem is that S3AFileSystem.create() looks for SemaphoredDelegatingExecutor(com.google.common.util.concurrent.ListeningExecutorService) which does not exist in hadoop-client-api-3.1.1.jar. What does exist is SemaphoredDelegatingExecutor(org.apache.hadoop.shaded.com.google.common.util.concurrent.ListeningExecutorService).
To work around this issue I created a version of hadoop-aws-3.1.1.jar that relocated references to Guava.

Context

This issue has been impacting all my workflow when it comes to saving DataFrames to S3

Your Environment

AWS Java SDK version used: 1.11.95x
Operating System and version: Linux EC2 instance on both EMR cluster and SageMaker instance (local)

Function I use to set up my spark env locally:

import os
from typing import Tuple
import findspark

findspark.init()

import pyspark
from pyspark.sql import SparkSession
from pyspark.context import SparkContext

def setup_spark(app_name: str) -> Tuple[SparkSession, SparkContext]:
    # Submit args is actually set via the file `jupyter-env.sh`, but I'll leave it here for completion
    os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages org.apache.hadoop:hadoop-aws:3.2.2,com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3,org.apache.spark:spark-avro_2.12:2.4.5,com.amazonaws:aws-java-sdk:1.11.956 --repositories https://mmlspark.azureedge.net/maven pyspark-shell"

    spark = (
        SparkSession.builder.appName(app_name)
        .config("spark.sql.execution.arrow.enabled", "true")
        .config("spark.sql.repl.eagerEval.enabled", "true")
        .getOrCreate()
    )
    sc = spark.sparkContext
    sc.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
    sc._jsc.hadoopConfiguration().set(
        "fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem"
    )
    return (spark, sc)

debora-ito commented 3 years ago

Hi @ottok92 the S3AFileSystem is in the hadoop-aws library, it is not part of the AWS SDK for Java, so we cannot help much with the issue.

According to the jira ticket you referenced, the issue was fixed in hadoop-aws:3.2.2 which is the version you're using, so I would check in your environment if the dependency is not being resolved to another version. For further questions, please reach out to hadoop support.

github-actions[bot] commented 3 years ago

COMMENT VISIBILITY WARNING

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

romj4k3 commented 2 years ago

Hi, Got the same issue with Spark 3.2.0 when I switched to the Magic committer. The possible solution doesn't work because it's already in the spark-Hadoop-cloud package. Any ideas?

victorvalentee commented 2 years ago

I'm currently having the same issue with SemaphoredDelegatingExecutor. Spark 3.2.2 supposedly solves it, but EMR does not support this version. @ottok92 were you able to find any solutions for this?

ottobricks commented 2 years ago

I did fix the issue but it was so long ago that I don't remember exactly what it was. The problem is not with Spark, it's with hadoop-aws. I suggest downloading the jars to S3 for the supported version and pointing to the with --jars. The necessary jars are:

org.apache.hadoop_hadoop-aws-3.2.1.jar
com.amazonaws_aws-java-sdk-bundle-1.11.375.jar

Note that for EMR 6.5.0, supported versions are:

Spark 3.1.2
Hadoop 3.2.1 https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-650-release.html

Aliaksandr-Kastenka commented 2 years ago

For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...'

romj4k3 commented 2 years ago

Hi

Yes, but when you change s3a to s3 it means you are using s3 without the driver. Anyway, we got EMR version 6.7.0 yesterday. You just need to update EMR to this version and all will work! 🙂

Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379

[http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png]

From: Aliaksandr Kastenka @.> Sent: Thursday, July 7, 2022 11:57 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510)

For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...'

— Reply to this email directly, view it on GitHubhttps://github.com/aws/aws-sdk-java/issues/2510#issuecomment-1177276946, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GRKOXNE2HKA4G5UL63VS2LZBANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

rjy7wb commented 2 years ago

Hi Yes, but when you change s3a to s3 it means you are using s3 without the driver. Anyway, we got EMR version 6.7.0 yesterday. You just need to update EMR to this version and all will work! 🙂 Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379 [http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png] … ____ From: Aliaksandr Kastenka @.> Sent: Thursday, July 7, 2022 11:57 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510) For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...' — Reply to this email directly, view it on GitHub<#2510 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GRKOXNE2HKA4G5UL63VS2LZBANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

YOU ARE WRONG: https://stackoverflow.com/a/71571625 https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html "Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability."

romj4k3 commented 2 years ago

Hi Yes, but when you change s3a to s3 it means you are using s3 without the driver. Anyway, we got EMR version 6.7.0 yesterday. You just need to update EMR to this version and all will work! 🙂 Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379 [http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png] …https://outlook.office.com/mail/inbox/id/AAQkADdkOTkxNjE1LWIzYjctNDVlOC04ZmM2LTA3MzJjYmIwNDk0OQAQAJ5AVXUMvhtKsp%2BOuPUyGgg%3D# ____ From: Aliaksandr Kastenka @.> Sent: Thursday, July 7, 2022 11:57 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510https://github.com/aws/aws-sdk-java/issues/2510) For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...' — Reply to this email directly, view it on GitHub<#2510 (comment)https://github.com/aws/aws-sdk-java/issues/2510#issuecomment-1177276946>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GRKOXNE2HKA4G5UL63VS2LZBANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

YOU ARE WRONG: https://stackoverflow.com/a/71571625 https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html "Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability."

Actually Nope! You are WRONG! We are working with Spark. The s3a is essential for the magic committer!!!!! It won't work with S3 !!!!!!!!!!!

Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379

[http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png]

From: rjy7wb @.> Sent: Friday, July 22, 2022 4:15 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510)

Hi Yes, but when you change s3a to s3 it means you are using s3 without the driver. Anyway, we got EMR version 6.7.0 yesterday. You just need to update EMR to this version and all will work! 🙂 Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379 [http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png] … ____ From: Aliaksandr Kastenka @.> Sent: Thursday, July 7, 2022 11:57 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510https://github.com/aws/aws-sdk-java/issues/2510) For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...' — Reply to this email directly, view it on GitHub<#2510 (comment)https://github.com/aws/aws-sdk-java/issues/2510#issuecomment-1177276946>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GRKOXNE2HKA4G5UL63VS2LZBANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

YOU ARE WRONG: https://stackoverflow.com/a/71571625 https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html "Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability."

— Reply to this email directly, view it on GitHubhttps://github.com/aws/aws-sdk-java/issues/2510#issuecomment-1192081150, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GSGTAXZG5OEOV22BQLVVHY2VANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

romj4k3 commented 2 years ago

Hi Yes, but when you change s3a to s3 it means you are using s3 without the driver. Anyway, we got EMR version 6.7.0 yesterday. You just need to update EMR to this version and all will work! 🙂 Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379 [http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png] … ____ From: Aliaksandr Kastenka @.> Sent: Thursday, July 7, 2022 11:57 AM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510) For me the issue disappeared when I changed path to the bucket from 's3a://...' to 's3://...' — Reply to this email directly, view it on GitHub<#2510 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GRKOXNE2HKA4G5UL63VS2LZBANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

YOU ARE WRONG: https://stackoverflow.com/a/71571625 https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html "Previously, Amazon EMR used the s3n and s3a file systems. While both still work, we recommend that you use the s3 URI scheme for the best performance, security, and reliability."

Actually Nope! You are WRONG! We are working with Spark. The s3a is essential for the magic committer!!!!! It won't work with S3 !!!!!!!!!!!

romj4k3 commented 2 years ago

Hi VIctor

How are u doing? You just need to add the script to the Bootstrap section when you spin up your cluster, this one -> spark-patch-s3a-fix_emr-6.6.0.sh =) Amazon provided this fix only for EMR 6.6.0. It's related to the s3a driver. If you use s3 without a driver then it should work!

Sincerely Roman Royzman Undertone | Data Engineer Tech Lead T +972-73-3981909 F +972-73-3982379

[http://mail-image.perion.com/Rebranding/Undertone/UndertoneLogo.png] [http://mail-image.perion.com/Rebranding/Undertone/IconsForUTLogo.png]

From: Victor Valente @.> Sent: Wednesday, June 29, 2022 5:42 PM To: aws/aws-sdk-java @.> Cc: Roman Royzman | Undertone @.>; Comment @.> Subject: Re: [aws/aws-sdk-java] NoSuchMethodError: SemaphoredDelegatingExecutor while writing files to S3 (#2510)

I'm currently having the same issuehttps://stackoverflow.com/questions/72794805/is-it-possible-to-use-a-custom-hadoop-version-with-emr/72800621#72800621 with SemaphoredDelegatingExecutor. Spark 3.2.2 supposedly solves it, but EMR does not support this version. @ottok92https://github.com/ottok92 were you able to find any solutions for this?

— Reply to this email directly, view it on GitHubhttps://github.com/aws/aws-sdk-java/issues/2510#issuecomment-1170071779, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF237GWPKXWU7XVKF3AL2ITVRROFFANCNFSM4XZL3IAA. You are receiving this because you commented.Message ID: @.***>

aws / aws-sdk-java