Closed MarcSpeckmann closed 5 months ago
Attribute | Dask | PySpark |
---|---|---|
Reputation (GitHub stars) | 10,3k | 33,8k |
Language | Python | Scala (Supports Python and R ) |
Ecosystem | Component of python. Couples and enhances numpy, pandas etc. | Own ecosystem. Integrates with other Apache projects. |
Created | 2014 | 2010 |
Scale | Single node to thousand-node clusters | Single node to thousand-node clusters |
Scope | Business intelligence, scientific and custom situations | Business intelligence (e.g. SQL and lightweight machine learning) |
Machine learning | Scikit-learn, XGBoost, dask-ml | Spark MLLib (only easy to implement algorithm), preferable to use other JVM based library |
DataFrame | Reuses the Pandas API and memory model. No SQL and query optimizer. Random access, pandas index operations | Own API and memory model, implements big SQL subset, high level optimizer for queries |
Graphs | Not supported | GraphX library |
Arrays | Fully supports numpy scalable multi-dimensional arrays | No native multi-dimensional arrays. 2D arrays in MLLib |
Custom parallelism | Arbitrary task graphs for non standard collections | Composes computations out of their high-level primitives (map, reduce, groupby, join, …) |
Lightweight, interoperates with other technologies | All-in-one solution | |
Cluster | Native Slum support | Can run on Slurm |
Sources: Dask documentation
@XxHalbfettxX Thanks Marc, can you please provide a table with the best option for each of our modules?
Module | Choice | Reason Dask | Reason PySpark |
---|---|---|---|
app_logger | Doesn't matter | Python logging module | Log4J from PySpark context |
check_results | Doesn't matter | ||
cols_grouping | Dask | DBSCAN native sklearn | Not included in MLLib, extern implementation needed |
dataset_clustering | Dask | DBSCAN native sklearn | Not included in MLLib, extern implementation needed |
ed_twolevel_rahas_features | |||
end_to_end_eds | |||
extract_labels | |||
generate_raha_features | |||
saving_results |
I find it a bit difficult to create the table for each module. For some modules there are obvious differences, but for most of them I can't really write anything because a lot of tasks can be solved with pyspark as well as with dask. At least I think it can be solved with both. There I lack the experience to say if the spark dataframe is better/efficient than the implementation of dask.
@XxHalbfettxX Thanks Marc. Let's look at it in other directions. Think about each module, and what we keep in the memory. For example, for generating features before classification, we keep all features of all tables into memory. When we want to talk about a whole data lake, which one makes sense to use? PySpark DataFrames? or Dask DataFrames? or even Pandas? Or even we can omit storing data in memory... For your computations, you can think about GitTables, and think about the statistics that we can see there, like the number of tables, rows, and columns. We don't need a very accurate estimate, we just need to know which one performs better.
Also, another direction is the speed of calculations. Which one is better than the other when we have a massive or small lake? You do not need to do tests. You can just report the results of your reading process.
I add some resources here. You can also share your references as you did for Dask website.
https://www.dominodatalab.com/blog/spark-dask-ray-choosing-the-right-framework
https://medium.com/geekculture/dask-or-spark-a-comparison-for-data-scientists-d4cba8ba9ef7
Creating a pros and cons list for each of these packages. The list should help to decide which of the two is better suited to scale the project.