dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.03k stars 315 forks source link

[FEATURE REQUEST]: Deprecate and/or evict Microsoft.Data.Analysis from the Microsoft.Spark assembly #1171

Open dbeavon opened 7 months ago

dbeavon commented 7 months ago

Is your feature request related to a problem? Please describe.

I'd like to deprecate Microsoft.Data.Analysis from this project, or at least move it out of Microsoft.Spark to a distinct assembly that must be introduced separately into the .Net drivers.

It can remain in the .net worker for now (Microsoft.Spark.Worker)

Describe the solution you'd like

I'm tired of doing the following :

using FxDataFrame = Microsoft.Data.Analysis.DataFrame;

This happens too frequently and it is nuts. This nonsense should not be necessary - especially not for a mission-critical class like DataFrame. The DataFrame class is perhaps the most fundamental component of Spark. The name should not be overloaded to mean two different things within the same Microsoft.Spark assembly.

It is sort of unconventional, but marking all the Udf's deprecated would still allow that version of DataFrame to be used but it would caution users to stay away.

To take a step further, we will move the related Udf stuff to a distinct assembly (Microsoft.Spark.Miscellaneous.DataAnalysis). This miscellaneous assembly would be introduced into Spark projects if developers really needed to use the "other DataFrame".

Ultimately this change will avoid confusing new Spark engineers. They are often unable to determine which version of the "DataFrame" is the "right" one. That type of confusion is unnecessary. Unfortunately that confusion is encountered all too quickly because of the proximity between the two types of "DataFrames" that are supported by Microsoft.Spark.

I'm willing to grant that the DataFrames live in different namespaces and that helps reduce confusion, of course. However a new Spark engineer will find the two classes as soon as they download the github project for the first time. They are likely to believe that the non-Spark version of DataFrame is more "advanced" or more "native" to .Net or is "better" in some way (else why would it be in the Spark project to begin with!) These assumptions are all wrong. I have always regretted using the "Microsoft.Data.Analysis" version of DataFrame whenever I took that path.

Final Note I think Microsoft.Data.Analysis must remain as part of the .net worker for the convenience (and to avoid breaking legacy projects).

dbeavon commented 7 months ago

@luisquintanilla Are you ok with this?

I think that having another library (Microsoft.Spark.Miscellaneous.DataAnalysis) is a good compromise and avoids losing the investments that were made on behalf of the ML community. It will only need to be introduced in the driver/runner project. Microsoft.Data.Analysis will remain in the .net worker next to the executors.