dotnet / spark

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.
https://dot.net/spark
MIT License
2.01k stars 310 forks source link

Discussion: Idiomatic F# APIs #41

Open cartermp opened 5 years ago

cartermp commented 5 years ago

This user experience item describes idiomatic APIs for C# and F#: https://github.com/dotnet/spark/blob/master/ROADMAP.md#user-experience-1

I think this would be a good issue to discuss what idiomatic looks like for F# in the context of spark.

Here's the (basic) sample from the .NET homepage:

// Create a Spark session
let spark =
    SparkSession.Builder()
        .AppName("word_count_sample")
        .GetOrCreate()

// Create a DataFrame
let df = spark.Read().Text("input.txt")

let words = df.Select(Split(df.["value"], " ").Alias("words")

words.Select(Explode(words["words"]).Alias("word"))
     .GroupBy("word")
     .Count()

Although this certainly isn't bad, a more idiomatic API could look something like this:

// Create a Spark session
let spark =
    SparkSession.initiate()
    |> SparkSession.appName "word_count_sample"
    |> SparkSesstion.getOrCreate

// Create a DataFrame
let df = spark |> Spark.readText "input.txt"

let words = df |> DataFrame.map (Split(df.["value"], " ").Alias("words"))

words
|> DataFrame.map (Explode(words["words"]).Alias("word"))
|> DataFrame.groupBy "word"
|> DataFrame.count

The above is just a starting point for a conversation. It would assume a module of combinators for data frames (and potentially other collection-like structures). Although this wouldn't be difficult to implement or maintain - it would be proportional to maintaining the one-liners in the C# LINQ-style implementation - I wonder what else could be done to make it feel more natural for F#, and what the best bang for our buck here is.

In other words, I'd love to solicit feedback on the kinds of things that matter most to F# developers interested in using Spark, so that it's possible to stack these up relative to their implementation and maintenance costs.

Also including @isaacabraham, as he tends to be a lot more creative than I am when it comes to these things 😄

isaacabraham commented 5 years ago

Thanks for tagging me :-) Definitely, having a mapping from a C# API that looks to already be pipelined and somewhat stateless to an F# one would be nice, although I imagine most of this would simply be mapping an extension method to a partially applied function, pushing the first argument to the first and a rename of the function.

Weighing up the cost / benefit of that, I'm not convinced it's worth embarking on that immediately (compared to something like ML .NET which IMHO needs much more work to be considered F#-friendly).

Other ideas / points might include:

"Quick wins" / "must haves"

Value adds

Data Exploration

The Value Adds and Data Exploration ones could really start to show some of the benefits of working with F# and Spark - things like compile time safety over data sets from samples with intellisense, use of FSI and the REPL etc. could be big wins on the .NET side.

zpodlovics commented 5 years ago

@dsyme, @7sharp9 The most unique value proposition would be F# metaprogramming (staging) that could allow us to implement similar functionality for F# (FlareData/TensorFlare implemented in Scala LMS). Code specialization (~collapsing abstractions) could provide orders of magnitude performance improvement.

Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data

"We present Flare, an accelerator module for Spark that delivers order of magnitude speedups on scale-up architectures for a large class of applications. Inspired by query compilation techniques from main-memory database systems, Flare incorporates a code generation strategy designed to match the unique aspects of Spark and the characteristics of scale-up architectures, in particular processing data directly from optimized file formats and combining SQL-style relational processing with external frameworks such as TensorFlow."

https://www.usenix.org/conference/osdi18/presentation/essertel https://github.com/Microsoft/visualfsharp/pull/3662#issuecomment-333332298

cartermp commented 5 years ago

Relevant language suggestion on improved staging of quotations is here: https://github.com/fsharp/fslang-suggestions/issues/584

I think it's a wonderful idea, and given Spark .NET's existence I can see it being given higher priority than it was given in the past.