dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.02k stars 314 forks source link

[BUG]: Confused by Project Status #543

Closed dbeavon closed 4 years ago

dbeavon commented 4 years ago

Microsoft Documentation is Inconsistent with Github in Regards to Release Status The Microsoft documentation talks about this project like it is ready for production use.
https://docs.microsoft.com/en-us/dotnet/spark/what-is-apache-spark-dotnet

But this github project seems to indicate that it is a pre-release with a version of v.0.11.0. I'm curious which of these is true. Should I be comfortable using it in production? Why is the release tagged as a "prerelease"? https://github.com/dotnet/spark/releases

It doesn't appear that there has ever been a release not a "pre-release".

Context I'm using DataBricks in Azure and learning Scala. I'd rather not have to learn another programming language just to make use of the DataBricks platform. The C# language should be fully capable of a the overhead/plumbing that is needed to interact with Hive and dataframes.

Additional context It may be out of context here, but I'm also curious if anyone here has ever used .Net for Spark on Azure Databricks. It is so hard to find google hits for my searches about this stuff. I'm guessing that this particular combination of technologies (spark/azure/databricks/.net) is still fairly rare. There is a single Microsoft web page on the topic of using .Net for Spark in Databricks. I believe it is telling me that this stuff is ready for production ... there are certainly no warnings about the fact that this project is a pre-release. And it also seems to say that it requires a "Premium" service tier (but doesn't explain why that would be, since this project is a community -supported effort).

Ideally the roadmap for thisi project would be updated to indicate if it is indeed production-ready. I was unable to gather that information from the roadmap. I'm still under the impression that this is a pre-release project: https://github.com/dotnet/spark/blob/master/ROADMAP.md

I also reached out to Azure DataBricks support, but they don't seem to be aware of this project, or ever received any support requests related to the .Net side of things. It raises a red flag when the support technicians never heard of this project either. Any clarification would be greatly appreciated! Thanks, David

rapoth commented 4 years ago

We are sorry for the confusion. We will work with the Microsoft documentation team to add additional details.

Let me make an attempt at answering some of your questions:

Please feel free to ask any more questions. I'd be happy to answer them. Thank you once again for your interest!

dbeavon commented 4 years ago

Thanks for the feedback. That is helpful.

Here is the site that seemed to indicate that a premium tier is needed.
https://docs.microsoft.com/en-us/dotnet/spark/tutorials/databricks-deployment

You will notice that there is a stand-alone notice about creating a trial subscription to "Premium Azure Databricks". I was incorrectly assuming that the .Net side of this requires the "Premium" tier:

When you create your Azure Databricks workspace, you can select the Trial (Premium - 14-Days Free DBUs) pricing tier to give the workspace access to free Premium Azure Databricks DBUs for 14 days.

It is good to hear that Premium is not required. I would love to try this stuff out ... but would probably need some support, as I am new to spark, databricks, azure, and scala (all at the same time). In order to avoid conflating a .Net -specific bug with another generic problem, I would probably need to recreate a problem on the scala side of things first, right?

While I'm at it, I have additional questions for you:

I'm a bit nervous about the prospect of using a technology like ".net for spark" which seems to be on the "bleeding edge". At the same time it seems a bit absurd to me that Microsoft is suggesting that a c#.Net shop like ours should use an unfamiliar language and runtime (scala/JVM) as the primary MPP approach for loading a SQL data warehouse in Azure.

As you are probably aware, the databricks stuff shows up regularly in their "reference architecture" for the "Modern Data Warehouse". Eg. See: https://docs.microsoft.com/en-us/azure/architecture/solution-ideas/articles/modern-data-warehouse The diagram doesn't seem too intimidating at a high level, until you start digging into the underlying pieces. I think databricks has been the biggest surprise to me (... and perhaps is the reason for the "Polybase" arrow that serves as a detour around databricks ?)

rapoth commented 4 years ago

You are right - that paragraph regarding Premium experience is not a 'required' step.

It is good to hear that Premium is not required. I would love to try this stuff out ... but would probably need some support, as I am new to spark, databricks, azure, and scala (all at the same time). In order to avoid conflating a .Net -specific bug with another generic problem, I would probably need to recreate a problem on the scala side of things first, right?

Please feel free to open Issues here on GitHub and we will be glad to help. We do not need you to provide a Scala repro. We just request that you provide us with enough information (e.g., minimum reproducible example) to help us repro the problem on our dev boxes.

If I need support, do I create an azure-databricks support case? Or should I come here for help?

Yes, if you are running on Azure Databricks with a successful application and it stopped working, we recommend that you create a ticket with them, since it is most likely the problem occurred due to a change in their runtime. On the other hand, if you could not ever get a successful run, and are hitting into any issues, you can feel free to open issues on this site and we'd be happy to help.

If I'm eventually able to get something working as desired in a pre-production environment, then will Microsoft (or databricks) discourage me from deploying it into production ?

It is unlikely since users are allowed to use external libraries on Databricks. However, receiving support from them is a different story - you might want to check with their support to see if they would put you in touch with the appropriate product group so you can request .NET for Apache Spark to be supported officially. From my personal understanding, they already have customer requests, but having more might help them make the right decision.

Is the Synapse version of this stuff (ie. "out-of-box first-class library") supported for production yet? Or is that in a state of pre-release as well ?

Synapse itself is in Public Preview, so the short answer to your question is no. However, we have many customers beginning to build their production workloads on Synapse and the teams are working hard to resolve issues.

Does the Databricks version of this stuff have enough momentum to avoid deprecation in the next couple of years? (And if that happens would the same thing happen on the Synapse side? And would Microsoft potentially force me to migrate my stuff to $ynapse if things don't pan out with .Net-on-Databricks?)

Databricks and Synapse are two entirely different product lines (as you can probably already tell). At the moment, we have invested heavily in making the Synapse .NET for Apache Spark experience rock solid (and have been open sourcing almost all pieces of the tech we are building so our users don't feel locked out). If you have any specific questions related to the roadmaps, the respective support groups are the right place to start. I am sorry I do not have a better answer for you - the GitHub site is home for the open source version of .NET for Apache Spark and I do not have enough information to comment on product roadmaps.

Assuming I were an advanced scala developer and advanced c#.Net developer, what would be the difference in productivity on the spark platform?

None to the best of my knowledge. Scala has IntelliJ as the IDE; .NET has Visual Studio as the IDE. Both require you to submit via command-line. There is an IntelliJ plugin but that's a wrapper around the command-line.

Would the c#.Net side have a great disadvantage when building and deploying solutions to spark?

Not to my knowledge. Please note that writing code in .NET will have ~10-15% performance penalty hit. In return, you get the ability to leverage your C# expertise. If you like the .NET ecosystem, .NET for Apache Spark offers you a way to avoid having to learn another language.

I realize that I won't have "out-of-the-box" support for notebooks in the azure portal and things like that ... but it isn't a huge concern. I'm much more concerned about iterative inner-loop when developing an assembly (ie. code-compile-deploy-debug).

In Synapse, .NET for Apache Spark comes with a full blown notebook experience. In fact, you can also make it work locally. We don't have instructions yet but we'll write them up in the next month or so.

Have you compared the impact on productivity when developing apps in .Net-vs-scala on spark? I don't want to waste the majority of my day waiting for some esoteric, .Net -specific compile or debug or deployment operation that would not equally affect a scala developer.

Unfortunately, I cannot comment on this. I have used them both extensively and I prefer the .NET side of things (maybe I am biased :)). We request you to form your opinion since this is such a personal question. Within Microsoft, we have worked with large internal teams and they have never complained about productivity. Of course, if you do end up facing issues, please feel free to open an issue here on GitHub and we'd be happy to help.

I'm a bit nervous about the prospect of using a technology like ".net for spark" which seems to be on the "bleeding edge". At the same time it seems a bit absurd to me that Microsoft is suggesting that a c#.Net shop like ours should use an unfamiliar language and runtime (scala/JVM) as the primary MPP approach for loading a SQL data warehouse in Azure.

We understand your concern and this is the reason for establishing this project (and open sourcing) it in the first place. It may not be perfect but we are constantly improving based on our user's feedback.

If you require specific guidance on Databricks/Synapse, please get in touch with your Microsoft contact and it is likely they will add me into the thread. I am happy to help if it puts you at ease.

dbeavon commented 4 years ago

Thanks for your feedback. You will probably hear from me again once I get started. I might wait for the "databricks-connect" functionality to become available to us, since that seems like it would give our productivity a significant boost. It is better if we spent less time waiting from one debug session to the next. Running the driver code locally seems like it would be quite a lot more efficient. While debugging, I would not submit any of the work at all to the remote cluster if possible - in cases where the local environment is sufficient.

I think I have a fairly good understanding of the state of the project. I've also heard back from databricks indirectly and some of them seem to think that .Net is ready for action as well.

It would help if this project's roadmap would contain very rough estimates of the ETA for a production release. One year? Five years? If there are closer milestones then it would be helpful to list those as well. When I previously looked at it, the project roadmap hadn't been updated for about a year. That is probably communicating the wrong message.

rapoth commented 4 years ago

I am working on updating it. I'll keep you posted. Thank you for your patience!

dbeavon commented 4 years ago

FYI, I had opened a ticket with Microsoft support. The support ticket was to inquire about the level of support offered for this. They say that .Net for Spark is not supported by Microsoft in Azure Databricks.

Yes, you can use dotnet for Azure Databricks but not supported by Microsoft Support request ID 120061124005245

They didn't discourage me from using it, however, and suggested I continue working with the open source community for any support issues.

dbeavon commented 4 years ago

I just received a final communication related to my Microsoft support ticket (about Microsoft support for this project). It comes from an operations manager with Microsoft Big Data Team (Chaith. S.).
He says...

Our product group confirmed that there is NO ETA for .Net on Spark with Azure Data Bricks. My support team will be constantly in touch with Product Group and will keep you posted as and when this Feature is made available.

I don't know if I mentioned this before, but the thing that was particularly confusing to me was when I watched a couple of very recent promotional videos on channel 9. Once again, there are no warnings about what is supported and what is not. The videos were very detailed in every other way, and it seems surprising to me that they were leaving out the fact that this stuff is all pre-release (or public preview in the case of Synapse).

I think there must be some deliberate reason why we are getting the mixed messages from the Microsoft Big Data Team. It is clear that on one hand they are NOT ready to support this in the context of “Azure Databricks” but, on the other hand, they don’t want to scare people away from it because there is a derived variation of this that is available as a public preview in "Azure Synapse". In that second context, it is intended to be lucrative/monetized. If all those documentation pages had warning messages about the fact that .Net for Spark is NOT yet supported then it would cause confusion for those VIP customers who are adopting (or had already adopted) Azure Synapse!

Whatever the case may be, I think the Microsoft Big Data Team needs to find a better way to communicate with ALL of their customers about the status of this project. It does not seem like this should be so difficult for such a smart bunch of people!

MikeRys commented 4 years ago

Hi David

Sorry for the mixed message that you are receiving.

The answer of the support engineer is saying that we do not have an ETA for having .NET for Spark available out of the box in Azure Databricks. This ETA is mainly owned by Databricks, because Databricks needs to add it, we can provide them only with the requests (if you haven’t done so, please vote up the user voice request).

Having said that, .NET for Spark works for Databricks Spark for batch mode submission based on the description provided here. Please note that this usage is supported via the open source support model, e.g., by the community here on Github and on stack overflow. That includes some Microsoft team members that are monitoring the issues and questions here.

Finally, since Databricks’ own Spark system is not fully internally API compatible to the Open Source version of Spark, we cannot guarantee that it works for all versions and scenarios (e.g., notebook integration, out of the box experience). Again, you will have to provide feedback to Databricks either via the Azure Databricks team or Databricks directly that you would like these scenarios to work.

Best regards Michael

From: David Beavon notifications@github.com Sent: Wednesday, July 1, 2020 6:50 AM To: dotnet/spark spark@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [dotnet/spark] [BUG]: Confused by Project Status (#543)

I just received a final communication related to my Microsoft support ticket (about Microsoft support for this project). It comes from an operations manager with Microsoft Big Data Team (Chaith. S.). He says...

Our product group confirmed that there is NO ETA for .Net on Spark with Azure Data Bricks. My support team will be constantly in touch with Product Group and will keep you posted as and when this Feature is made available.

I don't know if I mentioned this before, but the thing that was particularly confusing to me was when I watched a couple of very recent promotional videos on channel 9. Once again, there are no warnings about what is supported and what is not. The videos were very detailed in every other way, and it seems surprising to me that they were leaving out the fact that this stuff is all pre-release (or public preview in the case of Synapse).

I think there must be some deliberate reason why we are getting the mixed messages from the Microsoft Big Data Team. It is clear that on one hand they are NOT ready to support this in the context of “Azure Databricks” but, on the other hand, they don’t want to scare people away from it because there is a derived variation of this that is available as a public preview in "Azure Synapse". In that second context, it is intended to be lucrative/monetized. If all those documentation pages had warning messages about the fact that .Net for Spark is NOT yet supported then it would cause confusion for those VIP customers who are adopting (or had already adopted) Azure Synapse!

Whatever the case may be, I think the Microsoft Big Data Team needs to find a better way to communicate with ALL of their customers about the status of this project. It does not seem like this should be so difficult for such a smart bunch of people!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdotnet%2Fspark%2Fissues%2F543%23issuecomment-652430182&data=02%7C01%7Cmrys%40microsoft.com%7C9fbe411b109a4779bf0d08d81dc5a8e7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637292082066668433&sdata=sVnQKCTaoqgOao%2BmxO0E%2Bp0SfocaVOitBpnIigecE7U%3D&reserved=0, or unsubscribehttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACZXGJAKEOA3SJEKEIKTLMLRZM5IZANCNFSM4N22XH5A&data=02%7C01%7Cmrys%40microsoft.com%7C9fbe411b109a4779bf0d08d81dc5a8e7%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637292082066678426&sdata=P4UI4NXhYVkrKdQrcDUcXsQMvOLPMvuQtwlfze1Lqjc%3D&reserved=0.