dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.01k stars 311 forks source link

[FEATURE REQUEST]: Add support or documentation/examples on how to connect to common Azure services #287

Open joperezr opened 4 years ago

joperezr commented 4 years ago

One of the advantages of using Spark.NET over Scala/Java/Python is that you get the ecosystem benefits of coding in C# with VisualStudio as well as how this ecosystem works very well hand in hand with Azure. Most C# developers will use Azure for Cloud Services, so it seems to me that Spark.NET should have a great experience when reading/writing data from Azure Cloud services (like EventHub, CosmosDB, Azure Sql, Azure Storage, etc.).

The current way to do this using Spark.NET is that you need to pass in an extra argument (--jars) when running spark-submit passing Java libraries that add this support, and it is only limited to reading and writing where you manually need to pass in the connection configuration. These Java libraries have other functionalities built in, like generating connection configurations in order to connect to the Azure service, as well as other functionalities specific to each Azure service which you can't use through Spark.NET.

This issue is to track either the work of adding native support for connecting to Azure services without the need to pass in extra --jars, or to add examples and documentation to this repo on how to connect to them today.

imback82 commented 4 years ago

This issue is to track either the work of adding native support for connecting to Azure services without the need to pass in extra --jars,

I don't understand how you want to achieve this without specifying --jars if the connector is implemented in a jar. This is similar to referencing a dll. Am I missing something?

joperezr commented 4 years ago

We could do this if we grab the .jar that adds this functionality and pack it inside our spark jar, and then add a better abstraction from the .NET side, for example, adding connection configuration classes which will build and set the right parameters for you similar to what the regular Azure SDKs do. I get that this would make our .jar be larger, but if we believe that most customers working with spark.NET will be using Azure as either a source or destination, this might be something to consider. Again, that is only a suggestion, if we decide not to, then I believe that we should at least add documentation and examples on how to do this, which BTW I'm happy to add since I have some examples working for this.

imback82 commented 4 years ago

We could do this if we grab the .jar that adds this functionality and pack it inside our spark jar,

I don't think this is a good idea since the list can grow, maintaining will be hard, etc.

Again, that is only a suggestion, if we decide not to, then I believe that we should at least add documentation and examples on how to do this, which BTW I'm happy to add since I have some examples working for this.

Cool. Please create a PR with your examples. We can even add a section dedicated to Azure SDKs.

imback82 commented 4 years ago

We should have a documentation similar to https://docs.databricks.com/spark/latest/structured-streaming/data-sources.html.

cc: @rapoth @bamurtaugh @elvaliuliuliu

Niharikadutta commented 3 years ago

@joperezr Please find below a list of some recently added documentation to connect .NET for Apache spark applications to common Azure services:

  1. Connect to azure storage
  2. Connect to Azure Event hubs