aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 78 forks source link

EMR serverless "java.lang.ClassNotFoundException" #32

Closed suveerrao closed 2 years ago

suveerrao commented 2 years ago

we have a job which uses mssql driver and currently I am supplying below config as part of "spark properties" but I am getting below error.

"--conf spark.archives=/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.submit.pyFiles=/package-1.0.0-py3.8.egg --jars /jar/spark-mssql-connector_2.12-1.1.0.jar --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"

Error:

": java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver"

dacort commented 2 years ago

Hi @suveerrao - From your spark properties, it looks like all your file paths (spark.archives, pyFiles, and --jars) are not pointing to S3 locations?

I know at least with spark.archives you'll need to provide a location on S3. I see you've also created a virtualenv archive - you might be able to package the egg and jar inside of that, but then the path you'll need to provide will start with ./environment.

If you want to use the --packages flag to specify the mssql package, you can do that you'll just need to create your application in a VPC. There are more details on that here: https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies#pyspark-jobs-with-java-dependencies

dacort commented 2 years ago

Closing this for now, feel free to reopen if you're still experiencing an issue. :)