LUH-DBS / Matelda

Apache License 2.0
0 stars 0 forks source link

Dask vs. Pyspark #9

Closed MarcSpeckmann closed 5 months ago

MarcSpeckmann commented 1 year ago

Creating a pros and cons list for each of these packages. The list should help to decide which of the two is better suited to scale the project.

MarcSpeckmann commented 1 year ago
Attribute Dask PySpark
Reputation (GitHub stars) 10,3k 33,8k
Language Python Scala (Supports Python and R )
Ecosystem Component of python. Couples and enhances numpy, pandas etc. Own ecosystem. Integrates with other Apache projects.
Created 2014 2010
Scale Single node to thousand-node clusters Single node to thousand-node clusters
Scope Business intelligence, scientific and custom situations Business intelligence (e.g. SQL and lightweight machine learning)
Machine learning Scikit-learn, XGBoost, dask-ml Spark MLLib (only easy to implement algorithm), preferable to use other JVM based library
DataFrame Reuses the Pandas API and memory model. No SQL and query optimizer. Random access, pandas index operations Own API and memory model, implements big SQL subset, high level optimizer for queries
Graphs Not supported GraphX library
Arrays Fully supports numpy scalable multi-dimensional arrays No native multi-dimensional arrays. 2D arrays in MLLib
Custom parallelism Arbitrary task graphs for non standard collections Composes computations out of their high-level primitives (map, reduce, groupby, join, …)
Lightweight, interoperates with other technologies All-in-one solution
Cluster Native Slum support Can run on Slurm

Sources: Dask documentation

FatemehAhmadi94 commented 1 year ago

@XxHalbfettxX Thanks Marc, can you please provide a table with the best option for each of our modules?

MarcSpeckmann commented 1 year ago

WIP

Module Choice Reason Dask Reason PySpark
app_logger Doesn't matter Python logging module Log4J from PySpark context
check_results Doesn't matter
cols_grouping Dask DBSCAN native sklearn Not included in MLLib, extern implementation needed
dataset_clustering Dask DBSCAN native sklearn Not included in MLLib, extern implementation needed
ed_twolevel_rahas_features
end_to_end_eds
extract_labels
generate_raha_features
saving_results
MarcSpeckmann commented 1 year ago

I find it a bit difficult to create the table for each module. For some modules there are obvious differences, but for most of them I can't really write anything because a lot of tasks can be solved with pyspark as well as with dask. At least I think it can be solved with both. There I lack the experience to say if the spark dataframe is better/efficient than the implementation of dask.

FatemehAhmadi94 commented 1 year ago

@XxHalbfettxX Thanks Marc. Let's look at it in other directions. Think about each module, and what we keep in the memory. For example, for generating features before classification, we keep all features of all tables into memory. When we want to talk about a whole data lake, which one makes sense to use? PySpark DataFrames? or Dask DataFrames? or even Pandas? Or even we can omit storing data in memory... For your computations, you can think about GitTables, and think about the statistics that we can see there, like the number of tables, rows, and columns. We don't need a very accurate estimate, we just need to know which one performs better.

Also, another direction is the speed of calculations. Which one is better than the other when we have a massive or small lake? You do not need to do tests. You can just report the results of your reading process.

I add some resources here. You can also share your references as you did for Dask website.

https://www.dominodatalab.com/blog/spark-dask-ray-choosing-the-right-framework

https://medium.com/geekculture/dask-or-spark-a-comparison-for-data-scientists-d4cba8ba9ef7

https://stackoverflow.com/questions/38882660/at-what-situation-i-can-use-dask-instead-of-apache-spark

https://coiled.io/blog/dask-as-a-spark-replacement/