fugue-project / tutorials

Tutorials for Fugue - A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
111 stars 19 forks source link

What's the difference between processor & transformer? #75

Closed rdmolony closed 2 years ago

rdmolony commented 3 years ago

My attempt at finding the differences from the docs:

processor is on the driver side and transformer on the worker. driver side means that processor is aware of its execution engine while worker side transformer is not. Using processor explicitly specifies in the DAG that this step is not ExecutionEngine-agnostic.

kvnkho commented 3 years ago

This is certainly one of the harder concepts to immediately grasp (and this can be made clearer), but there are a couple of key differences.

  1. Using the execution engine directly is what allows you to use Spark or Dask commands. Because the transformer operates on partitions in worker machines, it works with local DataFrames whereas Spark and Dask commands work on distributed DataFrames.

  2. Something very tempting for Fugue is using something like sklearn.MinMaxScaler, which normalizes a column based on the minimum and maximum values. There is different behavior if your normalizing logic happens on the driver versus on the workers. On the workers, this happens locally without access to the global dataset. The min and max obtained for scaling happen on the partition level. On the other hand, using the Spark MinMaxScaler obtains the global min and max values for scaling.

You think to think in terms of map operations and aggregate. Let's say map is row-rise and aggregate is column-wise. For map operations, the behavior of transformer and processor will be minimal. For aggregate operations, you can get different values.

rdmolony commented 3 years ago

Thanks Kevin, your aggregate caveat makes it a bit clearer for me

rdmolony commented 3 years ago

Perhaps this could be added to beginner_extensions.ipynb, somewhere in extensions/ or an FAQ?

kvnkho commented 3 years ago

This should ideally be made clear in the Extensions fact. It took a long time for me to understand myself (a couple of months)