Having an UDF in the pipeline breaks consistently on local and on Azure HDInsights/Spark 2.4 setup

mllab-nl commented 4 years ago

Describe the bug Having an UDF in the pipeline breaks consistently on local and on Azure HDInsights/Spark 2.4 setup

To Reproduce

Follow the 10 min tutorial - works fine(On local and Azure setup). Add a simple UDF to make works uppercase. - breaks with FileNotFoundException exception in both setups

Rerun the app - breaks both setups

Expected behavior Should not break

Additional context Worker version "Microsoft.Spark.Worker-0.10.0"

[2020-04-24T08:56:24.0906298Z] [DESKTOP-XXXXX] [Error] [TaskRunner] [0] ProcessStream() failed with exception: System.IO.FileNotFoundException: Assembly 'mySparkApp, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null' file not found 'mySparkApp.dll' in 'C:\Users\XXXX\AppData\Local\Temp\spark-5fd960ac-2e4f-4f84-b726-7d7e72aeca23\userFiles-d016a082-5d84-411c-91c5-ec8e859bd516,C:\src\Spark\mySparkApp,C:\bin\Microsoft.Spark.Worker-0.10.0' at Microsoft.Spark.Utils.AssemblyLoader.LoadAssembly(String assemblyName, String assemblyFileName) in //src/csharp/Microsoft.Spark/Utils/AssemblyLoader.cs:line 122 at Microsoft.Spark.Utils.UdfSerDe.<>c.b__10_0(TypeData td) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 260 at System.Collections.Concurrent.ConcurrentDictionary2.GetOrAdd(TKey key, Func2 valueFactory) at Microsoft.Spark.Utils.UdfSerDe.DeserializeType(TypeData typeData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 258 at Microsoft.Spark.Utils.UdfSerDe.Deserialize(UdfData udfData) in //src/csharp/Microsoft.Spark/Utils/UdfSerDe.cs:line 160 at Microsoft.Spark.Utils.CommandSerDe.DeserializeUdfs[T](UdfWrapperData data, Int32& nodeIndex, Int32& udfIndex) in //src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 267 at Microsoft.Spark.Utils.CommandSerDe.Deserialize[T](Stream stream, SerializedMode& serializerMode, SerializedMode& deserializerMode, String& runMode) in //src/csharp/Microsoft.Spark/Utils/CommandSerDe.cs:line 243 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 190 at Microsoft.Spark.Worker.Processor.CommandProcessor.ReadSqlCommands(PythonEvalType evalType, Stream stream, Version version) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 117 at Microsoft.Spark.Worker.Processor.CommandProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\CommandProcessor.cs:line 62 at Microsoft.Spark.Worker.Processor.PayloadProcessor.Process(Stream stream) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\Processor\PayloadProcessor.cs:line 74 at Microsoft.Spark.Worker.TaskRunner.ProcessStream(Stream inputStream, Stream outputStream, Version version, Boolean& readComplete) in D:\a\1\s\src\csharp\Microsoft.Spark.Worker\TaskRunner.cs:line 143

mllab-nl commented 4 years ago

Workaround for local setup is to run it from bin\Debug\netcoreapp3.1 (with adjusted paths) is there a workaround available for Azure HDInsights/Spark 2.4 ?

mllab-nl commented 4 years ago

It looks like it is described here: https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries

elvaliuliuliu commented 4 years ago

@mllab-nl : Did this udf FAQ guide solve your issue?

mllab-nl commented 4 years ago

Local deployment works with the following cmd script:

set DOTNET_ASSEMBLY_SEARCH_PATHS=%cd%\bin\Debug\netcoreapp3.1 echo %DOTNET_ASSEMBLY_SEARCH_PATHS% %SPARK_HOME%\bin\spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin\Debug\netcoreapp3.1\microsoft-spark-2.4.x-0.10.0.jar dotnet bin\Debug\netcoreapp3.1\mySparkApp.dll > out 2>&1

elvaliuliuliu commented 4 years ago

Can you try with this instruction on the HDI Cluster or use --files to place dlls in the executor as refered in the parameter options?

mllab-nl commented 4 years ago

Was able to get it working with --file: $SPARK_HOME/bin/spark-submit --master yarn --num-executors 10 --class org.apache.spark.deploy.dotnet.DotnetRunner \ --files wasbs://cdotc@cdots.blob.core.windows.net/publish100m/mySparkApp.dll#mySparkApp.dll \ wasbs://cdotc@cdots.blob.core.windows.net/microsoft-spark-2.4.x-0.10.0.jar wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip mySparkApp > out 2>&1

elvaliuliuliu commented 4 years ago

@mllab-nl : Nice! Good to know. Please let us know if there are any further questions. Thanks!

mllab-nl commented 4 years ago

@elvaliuliuliu was not able to get the wanted --archive option to work. it looks like the environment variables passed via --conf are not set in the executor. Am I doing it wrong ?

mllab-nl commented 4 years ago

Not working command: $SPARK_HOME/bin/spark-submit --master yarn --num-executors 10 --class org.apache.spark.deploy.dotnet.DotnetRunner --conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs/publish --conf spark.yarn.appMasterEnv.XXX=a \ --archives wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip#udfs \ wasbs://cdotc@cdots.blob.core.windows.net/microsoft-spark-2.4.x-0.10.0.jar wasbs://cdotc@cdots.blob.core.windows.net/publish100m.zip mySparkApp > out 2>&1

mllab-nl commented 4 years ago

imback82 commented 4 years ago

@mllab-nl Did you try --deploy-mode cluster? If you are deploying as a client mode, you need to set DOTNET_ASSEMBLY_SEARCH_PATHS locally. (spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS is for cluster mode)

mllab-nl commented 4 years ago

Thx @imback82 . Indeed in cluster mode variables are set correctly.

dotnet / spark

Having an UDF in the pipeline breaks consistently on local and on Azure HDInsights/Spark 2.4 setup #494