Open dbeavon opened 7 months ago
@luisquintanilla Are you ok with this?
I think that having another library (Microsoft.Spark.Miscellaneous.DataAnalysis) is a good compromise and avoids losing the investments that were made on behalf of the ML community. It will only need to be introduced in the driver/runner project. Microsoft.Data.Analysis will remain in the .net worker next to the executors.
Is your feature request related to a problem? Please describe.
I'd like to deprecate Microsoft.Data.Analysis from this project, or at least move it out of Microsoft.Spark to a distinct assembly that must be introduced separately into the .Net drivers.
It can remain in the .net worker for now (Microsoft.Spark.Worker)
Describe the solution you'd like
I'm tired of doing the following :
This happens too frequently and it is nuts. This nonsense should not be necessary - especially not for a mission-critical class like DataFrame. The DataFrame class is perhaps the most fundamental component of Spark. The name should not be overloaded to mean two different things within the same Microsoft.Spark assembly.
It is sort of unconventional, but marking all the Udf's deprecated would still allow that version of DataFrame to be used but it would caution users to stay away.
To take a step further, we will move the related Udf stuff to a distinct assembly (Microsoft.Spark.Miscellaneous.DataAnalysis). This miscellaneous assembly would be introduced into Spark projects if developers really needed to use the "other DataFrame".
Ultimately this change will avoid confusing new Spark engineers. They are often unable to determine which version of the "DataFrame" is the "right" one. That type of confusion is unnecessary. Unfortunately that confusion is encountered all too quickly because of the proximity between the two types of "DataFrames" that are supported by Microsoft.Spark.
Final Note I think Microsoft.Data.Analysis must remain as part of the .net worker for the convenience (and to avoid breaking legacy projects).