dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.05k stars 1.88k forks source link

Constructor for Microsoft.Data.Analysis.DataFrame that takes Microsoft.Spark.Sql.DataFrame as param. #6717

Open userr2232 opened 1 year ago

userr2232 commented 1 year ago

It just would be nice to be able to switch between spark and the analysis package. I'm trying to do some plotting that I'm not sure how to do with spark.

A constructor like: public DataFrame(Microsoft.Spark.Sql.DataFrame dataframe)

or a static function like: public DataFrame FromSparkDataFrame(Microsoft.Spark.Sql.DataFrame dataframe)

Describe alternatives you've considered I've been looking for ways to convert a spark DataFrame into a RecordBatch in order to be able to use the method FromRecordBatch, but hadn't found any relevant documentation either.

Additional context I don't know if it's relevant, but I'm running my code on Synapse Analytics.

michaelgsharp commented 10 months ago

@luisquintanilla @JakeRadMSFT any thoughts here?

luisquintanilla commented 10 months ago

I think this is a great idea and the goal would be to have data sharing between:

Unfortunately, the dotnet/spark project is not being actively maintained at the moment. So even if this work is done, it'd be difficult to keep it working with deprecated versions of dotnet/spark.

Also, @asmirnov82 has been doing amazing improvements on the Data.Analysis.DataFrame including work on Arrow which I think would be a way to enable this scenario.

luisquintanilla commented 10 months ago

@michaelgsharp I've marked this as Future but at the moment we're blocked on dotnet/spark.

dbeavon commented 7 months ago

Hi @luisquintanilla

I'd like to help with the Spark side of things.

However I do have concerns about the way that ML.Net libraries were introduced into that project (Microsoft.Data.Analysis.DataFrame).

By overloading the class name, "DataFrame", in Spark with the ML.Net variety it causes confusion. Doing that makes things more difficult for Spark beginners who are trying to create applications with .Net. It makes them very uncertain about which version of the dataframe to pick from, and which type of Spark delegates to use for greenfield development work. Most Spark application developer should be guided towards the "normal" Spark dataframe, and not the ML.Net variety.

I'm constantly having to clarify the code to avoid confusion, like so:

using FxDataFrame = Microsoft.Data.Analysis.DataFrame;

To be honest, the ML.Net portions of that project should be split out and/or brought back into this community where people care more about it. I think it is asking too much of the Spark community to maintain compatibility with two versions of "DataFrame", when we are having trouble maintaining the project as a whole!

I know that you folks have invested a ton of effort into both of these github projects over the years. But I think the other community can benefit from a bit of simplification, considering that Microsoft doesn't have any paid maintainers on that project anymore. I think any kind of maintenance of that Spark project would be better than no maintenance at all. Even removing code and simplifying the project would be better than allowing the whole thing to wither and die. Hopefully this makes sense.