Closed suveerrao closed 2 years ago
Hi @suveerrao - From your spark properties, it looks like all your file paths (spark.archives
, pyFiles
, and --jars
) are not pointing to S3 locations?
I know at least with spark.archives
you'll need to provide a location on S3. I see you've also created a virtualenv archive - you might be able to package the egg and jar inside of that, but then the path you'll need to provide will start with ./environment
.
If you want to use the --packages
flag to specify the mssql package, you can do that you'll just need to create your application in a VPC. There are more details on that here: https://github.com/aws-samples/emr-serverless-samples/tree/main/examples/pyspark/dependencies#pyspark-jobs-with-java-dependencies
Closing this for now, feel free to reopen if you're still experiencing an issue. :)
we have a job which uses mssql driver and currently I am supplying below config as part of "spark properties" but I am getting below error.
"--conf spark.archives=/artifacts/pyspark/pyspark_ge.tar.gz#environment --conf spark.submit.pyFiles=/package-1.0.0-py3.8.egg --jars /jar/spark-mssql-connector_2.12-1.1.0.jar --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
Error:
": java.lang.ClassNotFoundException: com.microsoft.sqlserver.jdbc.SQLServerDriver"