Best practice around passing DF to multiple functions

AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.

1.56k stars 672 forks source link

Hi Alex, It is quite informative and helpful project. However I have a question around ETL job.

In extract_data(spark) , we build a dataframe from different sources (i.e. S3, CSV or any database),

suppose in transform_data(df, steps_per_floor_), we have multiple other methods to transform the data...

i.e. cleaning(df), setup_different_conditional_edits(df), windowing_function_transformations(df), other_statestical_tranformations(df)

Is it performance efficient to pass dataframe as python method argument ? or using UDF is more performance efficient ? what if the dataframe size is in ~ GB (10-100)?

Can you point me to right direction. Thanks for this helpful project :)

AlexIoannides / pyspark-example-project

Best practice around passing DF to multiple functions #18