DataFrame enhancements - Githubissues

GKrivosheev-rms commented 2 years ago

I see dozens of issues and enhancement suggestions for DataFrame in Microsoft.Data.Analysis namespace untouched for almost a year. Are there any resources allocated to address those? Is the project dead? Are there any plans to fund the work on those features in the future? Should we base any future development on these?

Specific enhancements desired:

Array/VBuffer column types
Sort by multiple columns
GroupBy by multiple columns
Parquet read/Write (currently the ParquetSharp.DataFrame has some limited support)

luisquintanilla commented 2 years ago

Hi @GKrivosheev-rms

Thanks for raising this issue. We're planning on evaluating the data preparation / data wrangling story in the coming months as outlined in the roadmap. We suspect the DataFrame API has a role to play there but until we have a clearer picture on common uses, asks, and pain points with the existing API, there is no active development on the DataFrame API at this time. That doesn't mean the project is dead or issues and feature requests like these aren't being taken into account. They are going to help frame our investigations and prioritize our efforts. Because the DataFrame API is currently in preview and we don't expect to add new features within the next couple of months, personally I would not take hard dependencies on it at this time for critical systems.

Let us know if you have additional questions or issues.

GKrivosheev-rms commented 2 years ago

Thanks, Luis!

GKrivosheev-rms commented 2 years ago

Luis, Just to give you a context, we are considering the DataFrame and related code to build a natural disaster modeling framework for RMS / Moody's Analytics that underpins the trillion dollar Catastrophy (Re)Insurance industry. The columnar data type fits nicely for processing insurance losses while doing large-scale analytics and data processing. It's a very nice paradigm. However, in order for us to use it, it needs support and basic enhancements listed above.

luisquintanilla commented 2 years ago

Tagging for visibility: @GKrivosheev-rms

Thanks Gleb for providing additional context around your scenario. To clarify, you're looking to use DataFrame for data processing and analytics, not exactly for building predictive analytics / machine learning models? If so, have you taken a look at .NET for Apache Spark?

It has it's own implementation of DataFrames which support:

Array column types.
Sort by multiple columns.
GroupBy) multiple columns
Parquet Read) / Write support.

Not sure if that would help solve your problem, but thought I'd mention it.

Here's an E2E example of .NET for Apache Spark and ML.NET as well as standalone examples from the .NET for Apache Spark repo.

GKrivosheev-rms commented 2 years ago

Thanks for suggestion, @luisquintanilla . I'll take a look.

Few questions:

Why are there are two very similar implementations of dataframes (Spark and Data Analytics)? Is there something in Analytics dataframes that Spark dataframes can't do? Are there plans to consolidate?
Do Spark dataframes or DF operations require a full Spark engine installed and running for single-machine operations?
For ML and data prep workloads for analytics DataFrames, can I apply ML .NET transforms, readers, writers and learners? If the answer is yes, then how can it work without supporting metadata and vector/VBuffer type columns?
What percent of ML .NET features are supporteed via dataframes? Do you have any samples?

Regards, Gleb

luisquintanilla commented 2 years ago

@GKrivosheev-rms great questions. I've tried to answer them below.

The DataAnalytics DataFrame can be thought of similar to the Pandas DataFrame whereas the .NET for Spark DataFrame is just the DataFrame implementation Spark uses. As a result, the DataAnalytics DataFrame you can use without any dependencies locally on your PC while Spark DataFrames run on the Spark engine, which you could run on your own PC, but there's some setup and dependencies required. At the moment there are no plans for consolidation, though there is some interop. Here are some examples of that:

You need the full Spark engine installed to run operations on Spark DataFrames. These are the setup instructions. You could also use a cloud service / product like Azure Synapse, HDInsight, DataBricks, AWS EMR, etc. .NET for Spark runs anywhere that Spark runs including a single-machine like your PC. You can also use .NET notebooks locally on your PC if you prefer a more interactive way of working with Spark other than spark-submit jobs on the command line.
The short answer is yes, though the interop between DataAnalytics DataFrames and ML.NET is limited at the moment. DataFrame implements IDataView so you can take a DataFrame and use it like you would an IDataView, but going from IDataView to DataFrame doesn't always work because some types like vector/VBuffer aren't supported. Here are some examples of using a DataFrame for training and inferencing:

Use DataAnalytics DataFrames for training (This is the example I mentioned previously).
Use DataAnalytics DataFrame to make predictions with an ML.NET model

I don't have an exact percentage of ML.NET features supported by DataFrames, but from the examples I've included above you can use DataFrames for training and inferencing. So long as the one of the data types you're working with are supported.

Hope this helps. Happy to clarify anything.

aloneguid commented 1 year ago

To add here, Parquet.Net which is already used in ML.NET has full built-in support for DataFrame read and write.

There is a sample C# interactive notebook demonstrating basic use (it's a one-liner) as well. It just works.

dotnet / machinelearning

DataFrame enhancements #6088