dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 315 forks source link

Connection refused issue #1183

Open isbn390 opened 2 weeks ago

isbn390 commented 2 weeks ago

Hi, I am a beginner working with spark and dotnet. Let me explain my setup first. I have a spark master worker setup deployed using bitnami helm chart. Image is custom made to include deltalake and I have a deltatable created and stored in the azure datalake. My requirement was to create an api to take some argument and query the deltatable in the azure datalake. Inorder to do the processing I need spark right?. So I thought I could call the spark in my AKS, pass the arguments to the spark, spark will query it from deltatable located in the azure and return back me the output. I created the api but upon running i am getting the below error

System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (111): Connection refused 127.0.0.1:5567

I tried the master url with an external ip, spark headless service, even local port. I presume the port and host is of the visual studio debug configuration. But how can I configure the session builder in the code? Is there any alternative solution to my requirement? Please share some insights , I believe spark-kubernetes-spark.net is a somewhat popular setup, hoping some could help. Thanks

dbeavon commented 2 weeks ago

If you just want to read a parquet file or deltatable from a storage account, then Spark may be overkill.
Especially as a beginner.

Can you start with Parquet.Net from nuget and point at the file from your API? That is what I would do. I probably wouldn't even use delta tables if you can get by with regular parquet files.

Spark is for massively parallel algorithms and transformations. If you aren't working with millions of rows, and you don't have timing constraints, then you probably don't need Spark..

isbn390 commented 2 weeks ago

Thanks @dbeavon , I will check it out, but still, do you have any idea about the connection issue.

Update: For simplicity, my api will only read the file and show the results. I installed spark locally, started master and a worker, manually built the code using dotnet build. Used the created .dll file to run spark-submit on my cmd line.

spark-submit ^ --packages io.delta:delta-core_2.12:1.2.0,org.apache.hadoop:hadoop-azure:3.2.0 ^ --class org.apache.spark.deploy.dotnet.DotnetRunner ^ --master spark://192.168.1.53:7077 ^ microsoft-spark-3-2_2.12-2.1.1.jar ^ dotnet ConsoleApp.dll

this setup is working fine and it's local. So for replicating entire setup remotely. I created a docker image out of the code and deployed in my AKS, there is a swagger interface for testing and it is showing the connection issue. If there any way I can set the host and port, even if I set my spark cluster address in my session builder, api is connecting to localhost:5567.