Open GKrivosheev-rms opened 2 years ago
Hi @GKrivosheev-rms
Thanks for raising this issue. We're planning on evaluating the data preparation / data wrangling story in the coming months as outlined in the roadmap. We suspect the DataFrame API has a role to play there but until we have a clearer picture on common uses, asks, and pain points with the existing API, there is no active development on the DataFrame API at this time. That doesn't mean the project is dead or issues and feature requests like these aren't being taken into account. They are going to help frame our investigations and prioritize our efforts. Because the DataFrame API is currently in preview and we don't expect to add new features within the next couple of months, personally I would not take hard dependencies on it at this time for critical systems.
Let us know if you have additional questions or issues.
Thanks, Luis!
Luis, Just to give you a context, we are considering the DataFrame and related code to build a natural disaster modeling framework for RMS / Moody's Analytics that underpins the trillion dollar Catastrophy (Re)Insurance industry. The columnar data type fits nicely for processing insurance losses while doing large-scale analytics and data processing. It's a very nice paradigm. However, in order for us to use it, it needs support and basic enhancements listed above.
Tagging for visibility: @GKrivosheev-rms
Thanks Gleb for providing additional context around your scenario. To clarify, you're looking to use DataFrame for data processing and analytics, not exactly for building predictive analytics / machine learning models? If so, have you taken a look at .NET for Apache Spark?
It has it's own implementation of DataFrames which support:
Not sure if that would help solve your problem, but thought I'd mention it.
Here's an E2E example of .NET for Apache Spark and ML.NET as well as standalone examples from the .NET for Apache Spark repo.
Thanks for suggestion, @luisquintanilla . I'll take a look.
Few questions:
Regards, Gleb
@GKrivosheev-rms great questions. I've tried to answer them below.
spark-submit
jobs on the command line.Hope this helps. Happy to clarify anything.
To add here, Parquet.Net which is already used in ML.NET has full built-in support for DataFrame read and write.
There is a sample C# interactive notebook demonstrating basic use (it's a one-liner) as well. It just works.
I see dozens of issues and enhancement suggestions for DataFrame in Microsoft.Data.Analysis namespace untouched for almost a year. Are there any resources allocated to address those? Is the project dead? Are there any plans to fund the work on those features in the future? Should we base any future development on these?
Specific enhancements desired: