Dask vs. Pyspark - Githubissues

MarcSpeckmann commented 1 year ago

Creating a pros and cons list for each of these packages. The list should help to decide which of the two is better suited to scale the project.

MarcSpeckmann commented 1 year ago

Attribute	Dask	PySpark
Reputation (GitHub stars)	10,3k	33,8k
Language	Python	Scala (Supports Python and R )
Ecosystem	Component of python. Couples and enhances numpy, pandas etc.	Own ecosystem. Integrates with other Apache projects.
Created	2014	2010
Scale	Single node to thousand-node clusters	Single node to thousand-node clusters
Scope	Business intelligence, scientific and custom situations	Business intelligence (e.g. SQL and lightweight machine learning)
Machine learning	Scikit-learn, XGBoost, dask-ml	Spark MLLib (only easy to implement algorithm), preferable to use other JVM based library
DataFrame	Reuses the Pandas API and memory model. No SQL and query optimizer. Random access, pandas index operations	Own API and memory model, implements big SQL subset, high level optimizer for queries
Graphs	Not supported	GraphX library
Arrays	Fully supports numpy scalable multi-dimensional arrays	No native multi-dimensional arrays. 2D arrays in MLLib
Custom parallelism	Arbitrary task graphs for non standard collections	Composes computations out of their high-level primitives (map, reduce, groupby, join, …)
	Lightweight, interoperates with other technologies	All-in-one solution
Cluster	Native Slum support	Can run on Slurm

Sources: Dask documentation

FatemehAhmadi94 commented 1 year ago

@XxHalbfettxX Thanks Marc, can you please provide a table with the best option for each of our modules?

MarcSpeckmann commented 1 year ago

WIP

Module	Choice	Reason Dask	Reason PySpark
app_logger	Doesn't matter	Python logging module	Log4J from PySpark context
check_results	Doesn't matter
cols_grouping	Dask	DBSCAN native sklearn	Not included in MLLib, extern implementation needed
dataset_clustering	Dask	DBSCAN native sklearn	Not included in MLLib, extern implementation needed
ed_twolevel_rahas_features
end_to_end_eds
extract_labels
generate_raha_features
saving_results

MarcSpeckmann commented 1 year ago

I find it a bit difficult to create the table for each module. For some modules there are obvious differences, but for most of them I can't really write anything because a lot of tasks can be solved with pyspark as well as with dask. At least I think it can be solved with both. There I lack the experience to say if the spark dataframe is better/efficient than the implementation of dask.

FatemehAhmadi94 commented 1 year ago

@XxHalbfettxX Thanks Marc. Let's look at it in other directions. Think about each module, and what we keep in the memory. For example, for generating features before classification, we keep all features of all tables into memory. When we want to talk about a whole data lake, which one makes sense to use? PySpark DataFrames? or Dask DataFrames? or even Pandas? Or even we can omit storing data in memory... For your computations, you can think about GitTables, and think about the statistics that we can see there, like the number of tables, rows, and columns. We don't need a very accurate estimate, we just need to know which one performs better.

Also, another direction is the speed of calculations. Which one is better than the other when we have a massive or small lake? You do not need to do tests. You can just report the results of your reading process.

I add some resources here. You can also share your references as you did for Dask website.

https://www.dominodatalab.com/blog/spark-dask-ray-choosing-the-right-framework

https://medium.com/geekculture/dask-or-spark-a-comparison-for-data-scientists-d4cba8ba9ef7

https://stackoverflow.com/questions/38882660/at-what-situation-i-can-use-dask-instead-of-apache-spark

https://coiled.io/blog/dask-as-a-spark-replacement/

LUH-DBS / Matelda

Dask vs. Pyspark #9

WIP