Open etrain opened 7 years ago
Thinking through this change today, I'm not so sure it's necessary at the moment. SparkSession
is part of the SparkSQL namespace and primarily designed to support Dataset
access. We need it in the Amazon pipeline because we're using SparkSQL's json decoding to load up json files, but then immediately convert the result to an RDD.
To really jump on the Spark 2.0 train, I would recommend the following:
SparkSession
and return a Dataset
.Dataset[T]
as well as RDD[T]
and do so in a way that takes advantage of the codegen features of spark 2.For the sake of consistency, it would be nice to have the Amazon Loader/Pipeline deal with SparkContexts rather than SparkSessions. Unfortunately, this can't easily happen internally to the loader because there is no public interface for creating a SparkSession given a SparkContext.
I'm happy to leave this issue open, but will probably assign an 0.5.0 milestone to it, since I'd rather see 2 and 3 get handled along with it.
Let me know what you think @tomerk @shivaram
Yeah I think that sounds reasonable.
We are using SparkContext throughout loaders and example pipelines. It makes sense to move these to using SparkSession given that we're relying on Spark 2.0.