scala spark vs. py-spark

meiertgrootes commented 4 years ago

As a member of team atlas, looking to compare dask and spark, I would like to briefly conduct a 'literature review' of the Scala implementation vs. the payspark API to SPARK. This will help in informing how to best proceed with the spark dask comparison

rogerkuou commented 4 years ago

After discussion with Sonja, we consider this done. A potential improvement can be instead of parallel between nodes, we can use a single node and parallel processes. This saves the CPU hours since it is charged per node (24 cores).

meiertgrootes commented 4 years ago

@rogerkuou can you please elaborate on this? Did you perhaps close the wrong issue?

rogerkuou commented 4 years ago

Sorry commented on the wrong issue

rogerkuou commented 4 years ago

@meiertgrootes Here is a summary I made for the comparison. Would you like to review it?

PySpark vs Scala Spark

API support

there are three basic APIs in Spark: Resilient Distributed Datasets (RDD), DataFrame and Dataset. RDD is the the fundamental data structure of Spark, while DataFrame and Datasets are more used in the high-level tools of Spark, such as SparkSQL and MLlib

RDD: Supported by both Scala and Python. In general when operating with RDD, Scala is considered to be more efficient than Python. This is because practically all data that comes to and from the Python executor has to be passed through a socket and a JVM worker. This can cause headaches in both debugging and performance. But in most of the case the performance is more case dependent, and the
DataFrame: A DataFrame is conceptually equal to a table in a relational database. It is available in both Scala and Python. In most of the cases, since Python code is mostly limited to high-level logical operations on the driver, there should be no performance difference between Python and Scala.
Datasets: A Dataset is an extension of DataFrame, which is claimed to provide the benefits of RDDs (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL’s optimized execution engine. The Dataset API is only available in Scala and Java, but not Python. But there are also arguments saying even in Scala, its implementation is too simplistic and doesn't provide the same performance benefits as DataFrame

Tools

MLlib: MLlib is Spark’s machine learning (ML) library. After Spark2.0 version, the primary Machine Learning API for Spark is now the DataFrame based but no longer RDD based (RDD based API is still available but in a maintenance mode. The plan is to remove it in version 3.0). Therefore the support for both Python and Scala should be equally well
SparkSQL: SparkSQL is a Spark module for structured data processing. It supports both Python and Scala. In addition it is mentioned that if one uses Python and want to work with Pandas/Numpy data together with Spark SQL, then PyArrow is recommended for efficiently transfer data between JVM and Python processes.
GraphX: GraphX is a Spark library for graphs and graph-parallel computation. There is no Python related documentation on the official doc site. GraphFrames library provides an alternative graph processing library with Python bindings.
Spark Streaming: Spark Streaming is an extension of the core Spark API for data streaming. According to the docs from databricks, now it supports both Python and Scala well

Performance

The performance differences are mostly considered when operates with the RDD, Scala uses Java Virtual Machine (JVM) during runtime which gives some speed over Python in most cases. On the other hand, Python may have stronger isolation than JVM. This is because Each Python executor runs in its own process, while Scala is runs single JVM multiple threads. But this may cause significantly higher memory usage.

Conclusions & Recommendations

When purely considering Spark performance and API support, Scala is better. Since it has better API support (mainly on GraphX) and performance is higher (mainly due to native JVM implementation in Scala, when operate with RDD). However these differences are not very significant. On the other hand, Python can provide more other powerful libraries, and is a language we are more familiar with.

My recommendation is, when start from scratch, unless there is a significant demand with GraphX, or there is a huge motivation to learn Scala, Python is still a more suitable choice for us.

However, taking into account the legacy script, the choice is also dependent on how much we want to do the re-write to Python. The good news is to some degree Scala Spark and PySpark can talk to each other, by usingDataFrames API as a bridge. Some exmaples are given here.

meiertgrootes commented 4 years ago

Hi @Ou,

thanks for this. Looking at your summary, I would tend to agree with your assessment for our project, and given the available time using a known language might be preferable over a potential speed-up (definitely for a broader user base). Ass the current scala legacy code pretty much only uses the scheduler, this also shouldn't make hat much difference. We might consider (if we have time) doing sth. simple, but compute intensive in both (although that can also be found online) to compare.

NLeSC / team-atlas