Open GoEddie opened 1 year ago
Update 30th Aug 2023
1160
but is there any way to bring it back to life?
@GoEddie @GeorgeS2019 since the project is open-source and part of the .NET Foundation is this something the community would be interested in contributing to in order to help move it forward?
@luisquintanilla We need to identify members here who are interested to maintain and merge the e.g. PR
For members here who are interested, please let @luisquintanilla know.
@luisquintanilla i'm definitely interested in helping to keep the project moving forward, I stopped raising/reviewing pr's as they were not getting reviewed and merged in but if there is a committer who is available to do that or if there was an opportunity for community members to become committers then I would be interested.
Thanks @GoEddie & @GeorgeS2019. This is definitely something for us to look into how we can unblock you and by extension the project.
@luisquintanilla I can try to help. I've encountered some minor bugs that may be low-hanging fruit.
One thing that might bog us down the most is not having the means of updating the Microsoft nuget package. https://www.nuget.org/packages/Microsoft.Spark/
Can you please explain (or give us links that explain) how the nuget packaging works for community projects, and whether it is still possible to publish new versions of it (even after Microsoft has abandoned the community)? Will someone other than Microsoft need to start publishing a different nuget?
Also I think there are portions of this project that need to be killed as the first order of business (especially if they were done on behalf of stakeholders who have left). For example, I'm pretty eager to kill all the weird cruft related to "Microsoft.Data.Analysis". That is a very minor amount of code that never worked well, and caused a lot of confusion. For example there are critical overloaded class names like "DataFrame" which are part of both namespaces! It was a bad fit for this project. Anyone who still needs to do an integration with "Microsoft.Data.Analysis" can do their own independent work to reintroduce that mess on their own. (That other project isn't even v.1 yet, in any case.)
@GoEddie Since you asked, one reason the activity on the project seems to have stopped is because of employee attrition at Microsoft. Certain full-time developers like Rahul Potharaju @rapoth and Terry Kim @imback82 have left Microsoft and started working at Databricks instead. And the critical PMs like @MikeRys have moved on to other areas within Microsoft. (Hi there Michael! Please come back!)
I am pretty disgruntled about the Synapse side of this story. In 2022 I had migrated all my projects from Databricks to Synapse where Microsoft was trying to monetize .Net for Spark. After spending several months working on this migration of my .Net projects, I encountered a relatively innocent bug that only affected Synapse (not OSS and not Databricks). I report the bug right around the time that the engineers were exiting the company. Only at the very end of an eight month support case did they say that they can no longer support .Net, and they are removing it from future versions of Synapse Analytics. It was a pretty painful experience, as you can probably imagine. (As a side the bug turned out to be just some stupid DNS configuration issue in their Bionic Beaver image, and wasn't specific to .Net. It would have affected the other language bindings as well.)
It doesn't stop there. After they removed .Net from their 3.3 runtime, this Synapse team started to deliberately spread misinformation to discourage people from using .Net on Spark at all.
See https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-33-runtime
They say this is a "project under the .NET Foundation that currently requires the .NET 3.1 library, which has reached the out-of-support status". Of course, the claim that Spark requires .Net Core 3.1 is categorically false. I'm able to run the project on .Net 6 without any problems. To add insult to injury they tell people "We recommend that users with existing workloads ... migrate to Python or Scala." I'm guessing that they are worried about the possibility that their .Net customers will just leave Synapse and find a better Spark offering. The PG's documentation doesn't even refer people back to this community, in order to pursue an alternative path forward with .Net. I find their communication to be pretty dishonest. And it almost seems like a deliberate attempt to sabotage this project.
I keep hoping we will see first-class support for .Net in Databricks, now that they have scalped some of the smart engineers from Synapse. (I do a regular google search, but it hasn't quite panned out yet.)
I had very high hopes for Synapse when they advertised their first-class .Net language bindings. There were even promotional community sessions where .Net for Spark was discussed, like here:
https://www.youtube.com/watch?v=-VpQheD-vE8 (start watching at 33:00 or so)
In any case, I am not that worried about the future. I am pretty sure .Net isn't going to die any time soon; and neither is Spark. The marriage of this pair is off to a rocky start. But they can't be kept apart forever!
@dbeavon Thanks for filling in the missing details!
I completely agree about getting rid of some other the other components, probably the first thing to do is to get the core Microsoft.Spark working with the newer versions of Spark and get rid of things like the extensions etc and maybe bring them back in the future.
@GoEddie @dbeavon @luisquintanilla
Thanks for not giving up on this project.
The fact that we are having this discussion is somehow sad!
Microsoft can not advocate the latest AI, large language model, and copilot without empowering .NET users on Spark.NET => the key and perhaps one of the very few available BIG DATA ANALYTICS pipeline for .NET community to continue their support for Microsoft's latest AI leadership.
As @luisquintanilla pointed it out: The project is part of the .NET Foundation which is open source and everybody can join to be a part of the .NET success story 😃: Become a member That Microsoft currently decided to not invest monetary ressources (e.g. paid developers( to this project should not stop this project from going forward IMO. Its just the question on how much are the community (= we/everybody) is willing to contribute and take the lead in this. There are other projects like Mono which (I guiess) over a decade where run by the community bringin .NET to Linux before Microsoft switched to "Microsoft loves Linux" and they started actively to work together with the Mono community to merge things together. One great result of that is now .NET MAUI (I hope I get the facts right here).
But nevertheless, I think the current issue this project has is that it is missing a a maintainer who is willing to actively review/merge PRs, handle issues, publish new releases etc.
@GoEddie @dbeavon @GeorgeS2019 Can you think to be part of the success story for .NET for Apache Spark ?
@leo-schick yes but im not sure what the process would be, at the moment it seems like no maintainers are active. It would be amazing if Microsoft invested paid developers in the project but even without that we, as a community, can maintain the project but have no access to anything!
As it as pointed out earlier by @GeorgeS2019 in his post, a maintainer is missing. If you want to become one, please contact @luisquintanilla
Thanks @leo-schick have you got any contact details for yourself or @luisquintanilla? My email is ed.elliott@outlook.com
I think dotnet-spark can be more liberal about which Apache Spark versions it supports. Instead of erroring out, it can just put out a warning. Apache Spark releases are more frequent, but I have found that existing dotnet-spark code works fine with newer versions of Apache Spark.
But I have to update the dotnet-spark code locally to make it work with new Spark versions (as the nuget packages don't work).
A more liberal version matching policy should reduce some of ongoing support effort required.
So, is there a next step here ? I'm also really interested in helping move this project forward @luisquintanilla
Hey @bolcman we are trying, it will take a bit more time but we will get there one way or another!
Lets keep this issue as a way for people to say if they want to contribute, it would be good to get an idea of numbers
@GoEddie how are things progressing here? I did a chunk of the .NET 6 PR a while back but sadly it was never released. Now .NET 8 is out and .NET 9 is under development...
As well as the publish-to-Nuget problem @dbeavon identified above I also ran into problems accessing the PR builds on Azure DevOps, which made troubleshooting the tests and other similar issues rather tricky, and ultimately relied on someone from Microsoft to resolve--@AFFogarty if I remember rightly. This will need addressing too.
My company has a small handful dotnet-spark jobs, but these are in our legacy pile with further investment focussed on Pyspark. dotnet-spark was perfect for us ~3 years ago, but we've now moved on. I think a key part of reviving this project needs to be resurrecting the Spark conversations (perhaps with the assistance of those people who have since moved on to Databricks, if they're willing), to put dotnet-spark on a level with Pyspark in terms of support and documentation.
I'm interested in offering my time to this project if it has still some life in it.
@GoEddie I've been analyzing the .NET market and individual projects for a few years since .NET was not so good in China market. I may give you some detailed evidence about this project here.
I analyzed major contributors of this project just now and I notice this project is mainly maintained by MSFTs. And the founder of this project @imback82 joined Databricks since Apr, 2022 according to his Linkedin. He is the most senior developer in the MSFT contributor team with principle title. And the second major contributor @suhsteve is also looking for a new job according to his Linkedin or he has left Microsoft.
I have no idea if they are from the same team, but it looks someone in Microsoft made a decision to stop this project obviously.
The major reason may be .NET is not so popularly used in the big data market including using .NET to operate Spark. The nuget download rate of this poject is even less than 1M, which is very low. Also ML.NET is not so popular (Microsoft.ML package only reaches 6M download). I checked with a few data scientists around me in the past few years. Some are my ex-colleagues and some are community friends. No one is using .NET at all.
The problem of this project is lack of key community contributors like you. And the donation to the .NET foundation doesn't really help attract new contributors. I've been in .NET foundation project committee for a few while. To be honest, I don't think this foundation works as expected compared with Apache foundation and CNCF foundation. They did promote projects with social media account. But the level of this kind of promotion only helps developers know some new project but totally not enough to attract developers to join them as a contributor.
Although I'm willing to see that .NET booms in the market, we have to face the fact that .NET totally failed to get the market in data science category. Python and Java are still the leading language in it. Nothing changed in the past 5-7 years.
And I did notice that MSFT staff are contributing to some new open source projects like Semantic Kernel and Aspire now. I guess these projects are their new focus. They are changing the track for business perspective.
@GoEddie There is another problem of this project. The projects starts in July, 2019 and almost stopped maintainence in end of 2022 (or even earlier).
It was maintained for just 3 years. Usually, it takes at least 5 years to attract more contributors. And the major contributor should continue the contribution all the time. Otherwise, developers may think that the project is dying. And they are not willing to contribute anymore although someone believes that there will be another hero who forked the project and restart it. I did analyzed a lot of existing .NET open source projects. This kind of fork occationally happens but it's very rare.
@tonyqus I appreciate the analysis.
I agree that this project has been stalled a little. I think it is primarily because everyone was waiting to see if Microsoft would come to their senses and try to re-hire some new engineers, like the ones they lost to Databricks (eg. Terry and Rahul). In any case, Microsoft has a lot of issues these days. Based on what I can tell, the Azure Synapse Analytics platform is falling apart, and it has nothing to do with the merits of C#.Net. Microsoft doesn't seem to have a great sense of direction or purpose in the area of big data. The C# bindings for Spark were a very important innovation. But right now Microsoft seems to be losing creativity and they are just dumping all of their customers into a mediocre swamp of tools called "Fabric", whether we like it or not! This approach is not likely to go very well in the long run. The approach is favored by those SaaS customers who were already heavily invested in Power BI. But it seems like an odd strategy for those of us looking for PaaS/IaaS options.
As a side, one of the major problems I have with Microsoft's big-data platforms is that they are making everyone use some buggy and expensive networking interfaces that are based on "private link". I'm not sure how the other customers are OK with it. My workloads encounter dozens of socket exceptions a day. The SLA claims 99.99% reliability but in practice there are bugs that cause TCP connectivity to fail a LOT more frequently than that. I'm guessing the target audience is mitigating this by increasing the number of retries (which Microsoft would certainly benefit from) ... or perhaps the related users aren't able to distinguish the Microsoft bugs from their own Spark bugs.
Going back to C#.Net.... the language is becoming more popular over time. It won the Tiobe language of the year: https://www.tiobe.com/tiobe-index/
I think it could overtake Java one day (especially since Java is probably being cannibalized by other JVM languages like Kotlin and Scala).
I think it is still too early to say if C# will be adopted as a popular language for Spark workloads. One thing that I've learned about Spark is that you can't just reduce it to a "data science" platform. Or to a "data analyst" tool. It is used for a lot of other types of MPP workloads. It can be found under the covers of tons of many cloud platforms. OSS Spark is a fairly cheap commodity, almost as inexpensive as the VM's that it runs on. I generally think Spark as a general-purpose "container" for hosting data-oriented software algorithms at scale.
I'm a big fan of Spark but a bigger fan of C#. There are many reasons. The tooling and the nuget ecosystem are both amazing. But I also like the ability to exchange code between a REST API (hosted in an on-premise IIS environment) and a Spark application hosted on an MPP cluster in the cloud. I can re-use the same underlying logic & data, and interact with it via synchronous requests or via asynchronous batches. We can do go back and forth very easily without switching between programming languages. There is no need to find a python programmer and ask them to create another copy of the application using a different language - simply for the sake of hosting on a cluster and getting the MPP advantages. I've been building applications using C# for 20 years; yet I'm still finding ways to use it more effectively and efficiently. It is very versatile and there are few applications that would be a bad fit for C#.
I suspect that the audience who would be using C# for their Spark development would not necessarily be data scientists as you imply. They would be software engineers, who are already building software solutions which are hosted in various other containers (eg. in web servers, kubernetes, and so on).
I'm not trying to detract from python. It is a productive language and easy to pick up. But python it is never going to take the place of C#.Net. C# is extremely well suited for high-performing applications that need to evolve over a very long period of time. It performs well, has lots of value-type data structures, and even its heap data is very efficient.
On a related note, one of the new performance benefits that I'm very excited about in .Net 8 is the AOT compilation. This should soon be accessible to my Spark jobs as well. What does this mean? It means the UDF's built on C#.net will be so fast that they will probably exceed the performance of the OSS Spark core itself. The only performance implication in selecting C# over Java/Scala is that it will always require Apache Arrow to exchange data between the Spark core and the UDF's.
@GoEddie Are you a committer?
Can you help me review https://github.com/dotnet/spark/pull/1166 ?
That issue (binary serialization) was a concern that was expressed by @AFFogarty
FYI, I really don't think binary serialization is a significant concern, aside from the fact that Microsoft is deprecating their class. We are simply replacing it with another binary serialization library (presumably with the same "vulnerabilities"). In any case the vulnerabilities are greatly overshadowed by the remote code execution that is a key feature of Spark solutions.
Also can you tell me how people become committers or writers on a project like this? I don't have that much prior experience with community github projects. I would love to help in some way. I think it is unlikely that Microsoft will give this project any TLC for another year or two. In the meantime there are obvious things we can do to keep it alive, like keep up with versions of Spark 3.x.x and keep up with .Net 8.
I have spent some time looking at the Spark Connect gRPC API and have put together a new .NET version of DataFrame API that uses the Spark Connect interface, which is actually working pretty well - it works against a local Spark Server as well as Databricks.
If anyone is interested in trying it or contributing, please see: https://github.com/GoEddie/spark-connect-dotnet
It is my hope that one day we can get access to this repo and nuget packages but in the meantime this supports Spark 3.4.0+
Repo:
Nuget:
Hi All,
It is pretty obvious that the project has come to a bit of a halt and I wondered if there was anything that we can do to get it up and running again?
I don't know the reason why it stopped in February, maybe if we knew that we could support it in some way?
I do know that I have had multiple orgs who wanted to use .NET and Spark in Databricks but in all honestly I couldn't recommend using it as it seems the project is now dead.
What would it take to bring it up to date with both .NET 7 support and support up to Apache Spark 3.4.1?
Is Microsoft willing to invest in the project or is the community able to take it on and progress it?
I'm just hoping to start a discussion really as it is a shame so much work went into this and we were so close to being able to use .NET instead of Python or Scala but it feels frustrating that we can't use .NET.
cc: @MikeRys @AFFogarty @suhsteve
@GeorgeS2019