dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 314 forks source link

Having an UDF in the pipeline breaks consistently on local and on Azure HDInsights/Spark 2.4 setup #494

Closed mllab-nl closed 4 years ago

mllab-nl commented 4 years ago

Describe the bug Having an UDF in the pipeline breaks consistently on local and on Azure HDInsights/Spark 2.4 setup

To Reproduce

Follow the 10 min tutorial - works fine(On local and Azure setup). Add a simple UDF to make works uppercase. - breaks with FileNotFoundException exception in both setups

Rerun the app - breaks both setups

Expected behavior Should not break

Additional context Worker version "Microsoft.Spark.Worker-0.10.0"

[2020-04-24T08:56:24.0906298Z] [DESKTOP-XXXXX] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found 'mySparkApp.dll' in 'C:\Users\XXXX\AppData\Local\Temp\spark-5fd960ac-2e4f-4f84-b726-7d7e72aeca23\userFiles-d016a082-5d84-411c-91c5-ec8e859bd516,C:\src\Spark\mySparkApp,C:\bin\Microsoft.Spark.Worker-0.10.0' at Microsoft.Spark.Utils.AssemblyLoader.LoadAssembly(String assemblyName, String assemblyFileName) in //src/csharp/Microsoft.Spark/Utils/AssemblyLoader.cs:line 122 at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 260 at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory) at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 258 at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160 at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in //src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 267 at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in //src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 243 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 190 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 117 at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 62 at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 74 at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 143

mllab-nl commented 4 years ago

Workaround for local setup is to run it from bin\Debug\netcoreapp3.1 (with adjusted paths) is there a workaround available for Azure HDInsights/Spark 2.4 ?

mllab-nl commented 4 years ago

It looks like it is described here: https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries

elvaliuliuliu commented 4 years ago

@mllab-nl : Did this udf FAQ guide solve your issue?

mllab-nl commented 4 years ago

Local deployment works with the following cmd script:

set DOTNET_ASSEMBLY_SEARCH_PATHS=%cd%\bin\Debug\netcoreapp3.1 echo %DOTNET_ASSEMBLY_SEARCH_PATHS% %SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-2.4.x-0.10.0.jar dotnet bin\Debug\netcoreapp3.1\mySparkApp.dll > out 2>&1

elvaliuliuliu commented 4 years ago

Can you try with this instruction on the HDI Cluster or use --files to place dlls in the executor as refered in the parameter options?

mllab-nl commented 4 years ago

Was able to get it working with --file: $SPARK_HOME/bin/spark-submit --master yarn --num-executors 10 --class org.apache.spark.deploy.dotnet.DotnetRunner \ --files wasbs://cdotc@cdots.blob.core.windows.net/publish100m/mySparkApp.dll#mySparkApp.dll \ wasbs://cdotc@cdots.blob.core.windows.net/microsoft-spark-2.4.x-0.10.0.jar wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip mySparkApp > out 2>&1

elvaliuliuliu commented 4 years ago

@mllab-nl : Nice! Good to know. Please let us know if there are any further questions. Thanks!

mllab-nl commented 4 years ago

@elvaliuliuliu was not able to get the wanted --archive option to work. it looks like the environment variables passed via --conf are not set in the executor. Am I doing it wrong ?

mllab-nl commented 4 years ago

Not working command: $SPARK_HOME/bin/spark-submit --master yarn --num-executors 10 --class org.apache.spark.deploy.dotnet.DotnetRunner --conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs/publish --conf spark.yarn.appMasterEnv.XXX=a \ --archives wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip#udfs \ wasbs://cdotc@cdots.blob.core.windows.net/microsoft-spark-2.4.x-0.10.0.jar wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip mySparkApp > out 2>&1

mllab-nl commented 4 years ago

sudo cat /proc/workerPID/environ: SPARK_YARN_STAGING_DIR=wasb://cdotc@cdots.blob.core.windows.net/user/sshuser/.sparkStaging/application_1587824952802_0050^@PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/var/lib/ambari-agent^@HADOOP_CONF_DIR=/usr/hdp/3.1.2.2-1/hadoop/conf^@JAVA_HOME=/usr/lib/jvm/zulu-8-azure-amd64^@SPARK_REUSE_WORKER=1^@LANG=en_US.UTF-8^@SPARK_LOG_URL_STDOUT=http://wn2-cdot3.rlqfaneo4frubgaiq5bzdzdldg.xx.internal.cloudapp.net:30060/node/containerlogs/container_e01_1587824952802_0050_01_000007/sshuser/stdout?start=-4096^@NM_HOST=wn2-cdot3.rlqfaneo4frubgaiq5bzdzdldg.xx.internal.cloudapp.net^@SPARK_LOCAL_DIRS=/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/blockmgr-0bb0eefb-98dc-4642-a0d9-547a2482c73c^@LD_LIBRARY_PATH=/usr/hdp/current/hadoop-client/lib/native:/usr/hdp/current/hadoop-client/lib/native/Linux-amd64-64:^@DOTNET_ASSEMBLY_SEARCH_PATHS=/tmp/spark-b6868dd6-ceb3-4ef7-a88c-c9eb410809d8/userFiles-d2f5ccf9-d46d-49cf-856f-d96925f61cb6^@LOGNAME=sshuser^@JVM_PID=74656^@PWD=/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007^@_=/usr/lib/jvm/zulu-8-azure-amd64/bin/java^@LOCAL_DIRS=/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050^@PYTHONPATH=/usr/hdp/3.1.2.2-1/spark2/jars/spark-core_2.11-2.4.0.3.1.2.2-1.jar^@NM_HTTP_PORT=30060^@SPARK_DIST_CLASSPATH=:/usr/hdp/current/spark2-client/jars/*:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/spark_llap/*:/usr/hdp/current/spark2-client/conf:^@LOG_DIRS=/mnt/resource/hadoop/yarn/log/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007^@PRELAUNCH_OUT=/mnt/resource/hadoop/yarn/log/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/prelaunch.out^@NM_AUX_SERVICE_mapreduce_shuffle=AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=^@NM_PORT=30050^@HADOOP_YARN_HOME=/usr/hdp/3.1.2.2-1/hadoop-yarn^@USER=sshuser^@CLASSPATH=/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/__spark_conf__:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/__spark_libs__/*:/usr/hdp/current/spark2-client/jars/*:/usr/hdp/3.1.2.2-1/hadoop/conf:/usr/hdp/current/hadoop-client/*:/usr/hdp/current/hadoop-client/lib/*:/usr/hdp/current/hadoop-hdfs-client/*:/usr/hdp/current/hadoop-hdfs-client/lib/*:/usr/hdp/current/hadoop-yarn-client/*:/usr/hdp/current/hadoop-yarn-client/lib/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/mapreduce/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/mapreduce/lib/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/common/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/common/lib/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/yarn/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/yarn/lib/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/hdfs/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/hdfs/lib/*:/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/mr-framework/hadoop/share/hadoop/tools/lib/*:/usr/hdp/3.1.2.2-1/hadoop/lib/hadoop-lzo-0.6.0.3.1.2.2-1.jar:/etc/hadoop/conf/secure::/usr/hdp/current/spark2-client/jars/*:/usr/lib/hdinsight-datalake/*:/usr/hdp/current/spark_llap/*:/usr/hdp/current/spark2-client/conf::/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/__spark_conf__/__hadoop_conf__^@PRELAUNCH_ERR=/mnt/resource/hadoop/yarn/log/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/prelaunch.err^@PYTHONUNBUFFERED=YES^@HADOOP_TOKEN_FILE_LOCATION=/mnt/resource/hadoop/yarn/local/usercache/sshuser/appcache/application_1587824952802_0050/container_e01_1587824952802_0050_01_000007/container_tokens^@SPARK_USER=sshuser^@LOCAL_USER_DIRS=/mnt/resource/hadoop/yarn/local/usercache/sshuser/^@HADOOP_HOME=/usr/hdp/3.1.2.2-1/hadoop^@SPARK_LOG_URL_STDERR=http://wn2-cdot3.rlqfaneo4frubgaiq5bzdzdldg.xx.internal.cloudapp.net:30060/node/containerlogs/container_e01_1587824952802_0050_01_000007/sshuser/stderr?start=-4096^@DOTNET_WORKER_SPARK_VERSION=2.4.0^@PYTHON_WORKER_FACTORY_SECRET=d8bf48a9537342c0995cf5df9cef7fe5208d162b877288a9af7e1efbbf54d530^@SHLVL=2^@HOME=/home/^@NM_AUX_SERVICE_spark2_shuffle=^@CONTAINER_ID=container_e01_1587824952802_0050_01_000007^@MALLOC_ARENA_MAX=4^

imback82 commented 4 years ago

@mllab-nl Did you try --deploy-mode cluster? If you are deploying as a client mode, you need to set DOTNET_ASSEMBLY_SEARCH_PATHS locally. (spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS is for cluster mode)

mllab-nl commented 4 years ago

Thx @imback82 . Indeed in cluster mode variables are set correctly.