JahstreetOrg / spark-on-kubernetes-helm

Spark on Kubernetes infrastructure Helm charts repo
Apache License 2.0
198 stars 76 forks source link

maven packages spark.jars.packages doesn't loaded into executers classpath #59

Open BenMizrahiPlarium opened 3 years ago

BenMizrahiPlarium commented 3 years ago

Hi

I'm having an issue while loading maven packages dependencies, while using SparkMagic and this helm chart for livy and spark in k8s.

In spark config I set the config: spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1

the dependencies downloaded into /root/.ivy2/jars but dosn't included into spark classpath and when trying to execute action I'm getting the following error:

21/01/05 11:22:15 INFO DAGScheduler: ShuffleMapStage 1 (take at :30) failed in 0.197 s due to Job aborted due to stage failure: Task 1 in stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0 (TID 17, 10.4.187.11, executor 2): java.lang.ClassNotFoundException: org.apache.spark.sql.kafka010.KafkaSourceRDDPartition

Do you have any suggestions ?

Thanks

jahstreet commented 3 years ago

Hi @BenMizrahiPlarium , one question to clarify your setup: do you configure spark.jars.packages in code through SparkConf ? Or how do you do that?

BenMizrahiPlarium commented 3 years ago

Hi @jahstreet,

I did the setup using the config section in Livy create session request - in other use cases I think that the jar artifacts downloaded from maven is shared via HDFS and available in the driver and worker - in this usecase it’s only available in the driver maven local repository.

I see that the artifacts download- but doesn’t available in the executers classpath

If you have any idea - it will be very helpful:)

brenoarosa commented 3 years ago

Currently running in the same issue.
I'm trying the session parameters bellow (which works with spark2).
I can see in the driver log messages that the package is downloaded from maven but it's not passed to the executors.

session_params = {
        "kind": "pyspark",
        "driverCores": 2,
        "driverMemory": "12g",
        "executorCores": 7,
        "executorMemory": "42g",
        "numExecutors": 1,
        "conf": {
            "spark.jars.packages": "mysql:mysql-connector-java",                                                                                         
        },    
    }

The jar is also not uploaded to spark.kubernetes.file.upload.path (using a s3 bucket), but I don't know if this is the expected behavior.

Passing the http link to file in the jars parameter also don't work.

session_params = {
        "kind": "pyspark",
        "driverCores": 2,
        "driverMemory": "12g",
        "executorCores": 7,
        "executorMemory": "42g",
        "numExecutors": 1,
        "jars": ["https://repo1.maven.org/maven2/mysql/mysql-connector-java/5.1.48/mysql-connector-java-5.1.48.jar"]
    }
BenMizrahiPlarium commented 3 years ago

I really think it’s because Spark has no shared file system between workers and driver.

As a workaround for now - I solve it using gsfuse by mounting google bucket in driver and workers and configure both maven local repository to point to this folder and add the folder to spark extra classpath.

so finally the bucket is mounted to /etc/google in both drivers and executers maven jars points to /etc/google/jars and spark config includes:

spark.driver.extraClassPath=/etc/google/jars/ spark.executer.extraClassPath=/etc/google/jars/

and it works, but my problem with that is that it’s not temporary it’s persistent between all sessions- and it’s not isolated per user.

brenoarosa commented 3 years ago

I understood from the documentation that spark.kubernetes.file.upload.path should be used for sharing the files between driver and executors.
Indeed this works livy related jars.

I set mine spark.kubernetes.file.upload.path to s3a://somosdigital-datascience/spark3/ This is the livy logs when creating a new session:

21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/asm-5.0.4.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar...
21/01/18 20:26:20 INFO LineBufferedStream: 21/01/18 20:26:20 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-api-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8f/livy-api-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-rsc-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:21 INFO LineBufferedStream: 21/01/18 20:26:21 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/minlog-1.3.0.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/minlog-1.3.0.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/netty-all-4.1.47.Final.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar...
21/01/18 20:26:22 INFO LineBufferedStream: 21/01/18 20:26:22 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/objenesis-2.5.1.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/rsc-jars/reflectasm-1.11.3.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar...
21/01/18 20:26:23 INFO LineBufferedStream: 21/01/18 20:26:23 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/commons-codec-1.9.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar...
21/01/18 20:26:24 INFO LineBufferedStream: 21/01/18 20:26:24 INFO KubernetesUtils: Uploading file: /opt/livy/repl_2.12-jars/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar to dest: s3a://somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar... 

This is the relevant snippet from the driver configmap. I can also see from the driver and executors logs that both of them downloads the jars files:

spark.kubernetes.file.upload.path=s3a\://somosdigital-datascience/spark3/ 
spark.jars=s3a\://somosdigital-datascience/spark3//spark-upload-6bb1a42f-082e-4621-8b6c-75e99ac92a76/asm-5.0.4.jar,s3a\://somosdigital-datascience/spark3//spark-upload-b3a531b6-8c66-4e30-9b7a-30ffbf27dd8 
 f/livy-api-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-738f5605-e3e0-499c-971a-7f8068be346d/livy-rsc-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/s 
 park3//spark-upload-0f275439-d128-4388-b59e-e66da063deb7/livy-thriftserver-session-0.8.0-incubating-SNAPSHOT.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6d2c2679-8e89-4096-97d0-4dae83ada91c/ 
 minlog-1.3.0.jar,s3a\://somosdigital-datascience/spark3//spark-upload-7ee64aca-6e94-43c3-8131-36d62bca562a/netty-all-4.1.47.Final.jar,s3a\://somosdigital-datascience/spark3//spark-upload-82bb5496-d3aa-47 
 cd-8a4e-58178e3d8b36/objenesis-2.5.1.jar,s3a\://somosdigital-datascience/spark3//spark-upload-4ff31636-583b-485e-b59d-a16006798b17/reflectasm-1.11.3.jar,s3a\://somosdigital-datascience/spark3//spark-uplo 
 ad-5f9c3fba-8357-4773-b8da-a350f9314f02/commons-codec-1.9.jar,s3a\://somosdigital-datascience/spark3//spark-upload-6dc3f8a0-147d-4f71-ac57-2cdbb1c76c00/livy-core_2.12-0.8.0-incubating-SNAPSHOT.jar,s3a\:/ 
 /somosdigital-datascience/spark3//spark-upload-4ed754bc-55ed-41f5-84df-9a846589ba89/livy-repl_2.12-0.8.0-incubating-SNAPSHOT.jar 

but the ones passed by the jars sessions parameters or config { spark.jars.packages } don't get uploaded.

BenMizrahiPlarium commented 3 years ago

In don’t think that spark.kubernetes.file.upload.path is related to external maven dependencies- it’s related to the spark application jar uploaded to S3.

Spark downloading maven dependencies into local maven repository and using it in the class path.

as far as I see, jars are loaded into the driver and every work done by the driver is fully supported, but when the actual task being execute in one of the executers- task failed due to ClassNotFound exception - it means that the jar isn’t available in the executer classpath