dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.03k stars 316 forks source link

Spark can't find DLL's specified #912

Open harishukla93 opened 3 years ago

harishukla93 commented 3 years ago

I am new to DOTNET with spark and facing some issues with passing DLLs. Basically, I have some DLL files (from another c# project) which I want to reuse here in my Spark project UDF.

Error: [Warn] [AssemblyLoader] Assembly 'Classes, Version=3.0.142.0, Culture=neutral, PublicKeyToken=910ab64095116ac0' file not found 'Classes[.dll,.ni.dll]' in '/tmp/spark-e2e6444a-99fc-42c6-ae15-8a5b328e3038/userFiles-aafb5491-4485-46d9-8e17-0849aed7c57a,/home/ubuntu/project/mySparkApp/bin/Debug/net5.0,/opt/Microsoft.Spark.Worker-1.0.0/' [2021-04-13T11:16:15.1691280Z] [ubuntu-Vostro] [Error] [TaskRunner] [1] ProcessStream() failed with exception: System.IO.FileNotFoundException: Could not load file or assembly 'Classes, Version=3.0.142.0, Culture=neutral, PublicKeyToken=910ab64095116ac0'. The system cannot find the file specified.

Here I have copied Classes.dll (an external DLL) file in my home/ubuntu/project/mySparkApp. Initially, I was facing the same error with mySparkApp.dll and I resolved that with copying in my current directory and that woked. But in case of this third party DLL, it failed to find.

Here is my .csproj file where I have mentioned the Classes.dll: `

Exe net5.0 /home/incs83/project/mySparkApp/Classes.dll /home/incs83/project/mySparkApp/CSharpZip.dll

` Here is spark-submit:

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner --master local bin/Debug/net5.0/microsoft-spark-3-0_2.12-1.0.0.jar dotnet bin/Debug/net5.0/mySparkApp.dll

I have spend a lot of time digging into this, still no luck.

clegendre commented 3 years ago

Spark.NET will look at your custom DLL using this environment variable; DOTNET_ASSEMBLY_SEARCH_PATHS So, just before spark-submit, you can set the environment variable targeting your dll folder:

set DOTNET_ASSEMBLY_SEARCH_PATHS=absolute_path_to_folder_containing_dlls

You can also copy these DLLs to Microsoft.Spark.Worker installation folder. (This what is perform on Databricks environment)

APP_DEPENDENCIES=/dbfs/apps/dependencies
WORKER_PATH=`readlink $DOTNET_SPARK_WORKER_INSTALLATION_PATH/Microsoft.Spark.Worker`
if [ -f $WORKER_PATH ] && [ -d $APP_DEPENDENCIES ]; then
   sudo cp -fR $APP_DEPENDENCIES/. `dirname $WORKER_PATH`
fi
harishukla93 commented 3 years ago

Thanks for the quick reply.

I copied in DOTNET_WORKER_DIR=/opt/Microsoft.Spark.Worker-1.0.0 but it didn't worked.

With above suggestion, I tried with adding another path export DOTNET_ASSEMBLY_SEARCH_PATHS="/home/ubuntu/Downloads/NewDLLs"

It gives me one more path added in the error, but error still persist. I think this is something strange, something silly I am missing.

Error:

[Warn] [AssemblyLoader] Assembly 'Classes, Version=3.0.142.0, Culture=neutral, PublicKeyToken=910ab64095116ac0' file not found 'Classes[.dll,.ni.dll]' in '/home/ubuntu/Downloads/NewDLLs,/tmp/spark-fa9e5b80-6caa-420f-ad36-1a37f155ba7c/userFiles-3679bf84-7b62-4a29-98fd-218238f3276a,/home/ubuntu/project/mySparkApp/bin/Debug/net5.0,/opt/Microsoft.Spark.Worker-1.0.0/' [2021-04-13T12:10:24.8533078Z] [incs83-Vostro-3490] [Error] [TaskRunner] [1] ProcessStream() failed with exception: System.IO.FileNotFoundException: Could not load file or assembly 'Classes, Version=3.0.142.0, Culture=neutral, PublicKeyToken=910ab64095116ac0'. The system cannot find the file specified.

harishukla93 commented 3 years ago

I also tried this Classes.dll and others in a normal c# project with mono to make sure the DLL is valid. It worked as expected.

suhsteve commented 3 years ago

Have you taken a look at https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries ? It documents the necessary Environment variables to set up or the required spark configurations you can use.

harishukla93 commented 3 years ago

Yes, this is what I followed. I am running with master=local with worker and UDF which is straightforward with this configuration.

This document has some parameter options which are available for yarn mode only. So I don't think I am missing anything from this document.

suhsteve commented 3 years ago

Looks like you are using .NET 5. Can you try recompiling your app using .NET Core 3.1 ?

clegendre commented 3 years ago

Good point, I had issues using .NET 5, regarding System.Runtime for example. I fixed them by downgrading to .NET Core 3.1.

harishukla93 commented 3 years ago

I have used this documentation for running my first app.: https://dotnet.microsoft.com/learn/data/spark-tutorial/install-dotnet

It redirects me to install .NET5. Anyways just installing .NET Core 3.1. I am getting a strong feeling that this will resolve the issue.

harishukla93 commented 3 years ago

Tried with .NET core 3.1, no luck. This is really strange.

suhsteve commented 3 years ago

@harishukla93 Does the file /home/ubuntu/Downloads/NewDLLs/Classes.dll exist ?

harishukla93 commented 3 years ago

Yes, infact this file is available on all three paths: /home/ubuntu/Downloads/NewDLLs /home/ubuntu/project/rs-etl-test/bin/Debug/netcoreapp3.1 /opt/Microsoft.Spark.Worker-1.0.0/

suhsteve commented 3 years ago

@harishukla93 was Classes.dll recompiiled and copied to the /home/ubuntu/Downloads/NewDLLs/ after recompiling your main app from .net 5 to .net core 3.1 ?

harishukla93 commented 3 years ago

This is something with this Classes.dll I have. I got to know from source of this DLL that this is Classes.dll is build for .net4.0 and x86.

Today morning I have got the new DLL with .NET Core 3.1, but still no luck.

harishukla93 commented 3 years ago

@harishukla93 was Classes.dll recompiiled and copied to the /home/ubuntu/Downloads/NewDLLs/ after recompiling your main app from .net 5 to .net core 3.1 ?

I had created a new app with 3.1 and with the hintpath mentioned in my .csproj, it copied the file to bin/Debug/netcoreapp3.1

So I don't think we even need to have a spparate copy at /home/ubuntu/Downloads/NewDLLs/ , I mean I am running spark-submit from bin/Debug/netcoreapp3.1 where I have all DLLs and this path is where worker is also looking for.DLL.

harishukla93 commented 3 years ago

@suhsteve I have used sources directly to get rid off the Classes.dll. But I am deserializing some data in UDF using System.Runtime.Serialization.Formatters BinaryFormatter and MemoryStream. But it is giving me below error:

[Warn] [AssemblyLoader] Assembly 'System.Runtime.Serialization.Formatters.resources, Version=4.0.4.0, Culture=en-IN, PublicKeyToken=b03f5f7f11d50a3a' file not found 'System.Runtime.Serialization.Formatters.resources[.dll,.ni.dll]' in '/tmp/spark-024dfc93-f0fc-4c04-8737-ba0dbc8370bf/userFiles-599198e1-61d3-43f7-b810-c6d5376c2d65,/home/incs83/project/rs-etl-test/bin/Debug/netcoreapp3.1,/opt/Microsoft.Spark.Worker-1.0.0/' [2021-04-20T06:51:51.5112399Z] [incs83-Vostro-3490] [Warn] [AssemblyLoader] Assembly 'System.Runtime.Serialization.Formatters.resources, Version=4.0.4.0, Culture=en, PublicKeyToken=b03f5f7f11d50a3a' file not found 'System.Runtime.Serialization.Formatters.resources[.dll,.ni.dll]' in '/tmp/spark-024dfc93-f0fc-4c04-8737-ba0dbc8370bf/userFiles-599198e1-61d3-43f7-b810-c6d5376c2d65,/home/incs83/project/rs-etl-test/bin/Debug/netcoreapp3.1,/opt/Microsoft.Spark.Worker-1.0.0/'

@clegendre Got to know this is obsolete in .NET5, so I am using .NETCore 3.1 but facing this issue.

Please help!!