Closed meiertgrootes closed 4 years ago
After discussion with Sonja, we consider this done. A potential improvement can be instead of parallel between nodes, we can use a single node and parallel processes. This saves the CPU hours since it is charged per node (24 cores).
@rogerkuou can you please elaborate on this? Did you perhaps close the wrong issue?
Sorry commented on the wrong issue
@meiertgrootes Here is a summary I made for the comparison. Would you like to review it?
there are three basic APIs in Spark: Resilient Distributed Datasets (RDD
), DataFrame
and Dataset
. RDD
is the the fundamental data structure of Spark, while DataFrame
and Datasets
are more used in the high-level tools of Spark, such as SparkSQL
and MLlib
RDD: Supported by both Scala and Python. In general when operating with RDD
, Scala is considered to be more efficient than Python. This is because practically all data that comes to and from the Python executor has to be passed through a socket and a JVM worker. This can cause headaches in both debugging and performance. But in most of the case the performance is more case dependent, and the
DataFrame: A DataFrame
is conceptually equal to a table in a relational database. It is available in both Scala and Python. In most of the cases, since Python code is mostly limited to high-level logical operations on the driver, there should be no performance difference between Python and Scala.
Datasets: A Dataset
is an extension of DataFrame
, which is claimed to provide the benefits of RDD
s (strong typing, ability to use powerful lambda functions) with the benefits of Spark SQL
’s optimized execution engine. The Dataset API is only available in Scala and Java, but not Python. But there are also arguments saying even in Scala, its implementation is too simplistic and doesn't provide the same performance benefits as DataFrame
MLlib: MLlib
is Spark’s machine learning (ML) library. After Spark2.0 version, the primary Machine Learning API for Spark is now the DataFrame
based but no longer RDD
based (RDD
based API is still available but in a maintenance mode. The plan is to remove it in version 3.0). Therefore the support for both Python and Scala should be equally well
SparkSQL: SparkSQL
is a Spark module for structured data processing. It supports both Python and Scala. In addition it is mentioned that if one uses Python and want to work with Pandas/Numpy data together with Spark SQL
, then PyArrow is recommended for efficiently transfer data between JVM and Python processes.
GraphX: GraphX
is a Spark library for graphs and graph-parallel computation. There is no Python related documentation on the official doc site. GraphFrames
library provides an alternative graph processing library with Python bindings.
Spark Streaming: Spark Streaming
is an extension of the core Spark API for data streaming. According to the docs from databricks, now it supports both Python and Scala well
The performance differences are mostly considered when operates with the RDD
, Scala uses Java Virtual Machine (JVM) during runtime which gives some speed over Python in most cases.
On the other hand, Python may have stronger isolation than JVM. This is because Each Python executor runs in its own process, while Scala is runs single JVM multiple threads. But this may cause significantly higher memory usage.
When purely considering Spark performance and API support, Scala is better. Since it has better API support (mainly on GraphX
) and performance is higher (mainly due to native JVM implementation in Scala, when operate with RDD
). However these differences are not very significant. On the other hand, Python can provide more other powerful libraries, and is a language we are more familiar with.
My recommendation is, when start from scratch, unless there is a significant demand with GraphX
, or there is a huge motivation to learn Scala, Python is still a more suitable choice for us.
However, taking into account the legacy script, the choice is also dependent on how much we want to do the re-write to Python. The good news is to some degree Scala Spark and PySpark can talk to each other, by usingDataFrames
API as a bridge. Some exmaples are given here.
Hi @Ou,
thanks for this. Looking at your summary, I would tend to agree with your assessment for our project, and given the available time using a known language might be preferable over a potential speed-up (definitely for a broader user base). Ass the current scala legacy code pretty much only uses the scheduler, this also shouldn't make hat much difference. We might consider (if we have time) doing sth. simple, but compute intensive in both (although that can also be found online) to compare.
As a member of team atlas, looking to compare dask and spark, I would like to briefly conduct a 'literature review' of the Scala implementation vs. the payspark API to SPARK. This will help in informing how to best proceed with the spark dask comparison