Closed rdmolony closed 3 years ago
This is certainly one of the harder concepts to immediately grasp (and this can be made clearer), but there are a couple of key differences.
Using the execution engine directly is what allows you to use Spark or Dask commands. Because the transformer operates on partitions in worker machines, it works with local DataFrames whereas Spark and Dask commands work on distributed DataFrames.
Something very tempting for Fugue is using something like sklearn.MinMaxScaler
, which normalizes a column based on the minimum and maximum values. There is different behavior if your normalizing logic happens on the driver versus on the workers. On the workers, this happens locally without access to the global dataset. The min and max obtained for scaling happen on the partition level. On the other hand, using the Spark MinMaxScaler obtains the global min and max values for scaling.
You think to think in terms of map
operations and aggregate
. Let's say map
is row-rise and aggregate
is column-wise. For map
operations, the behavior of transformer
and processor
will be minimal. For aggregate
operations, you can get different values.
Thanks Kevin, your aggregate caveat makes it a bit clearer for me
Perhaps this could be added to beginner_extensions.ipynb, somewhere in extensions/ or an FAQ?
This should ideally be made clear in the Extensions fact. It took a long time for me to understand myself (a couple of months)
processor
being able to take >1 input?My attempt at finding the differences from the docs:
processor
is on the driver side andtransformer
on the worker.driver side
means thatprocessor
is aware of its execution engine whileworker side
transformer
is not. Usingprocessor
explicitly specifies in the DAG that this step is not ExecutionEngine-agnostic.