AlexIoannides / pyspark-example-project

Implementing best practices for PySpark ETL jobs and applications.
1.56k stars 672 forks source link

Best practice around passing DF to multiple functions #18

Closed raxit65535 closed 3 years ago

raxit65535 commented 3 years ago

Hi Alex, It is quite informative and helpful project. However I have a question around ETL job.

In extract_data(spark) , we build a dataframe from different sources (i.e. S3, CSV or any database),

suppose in transform_data(df, steps_per_floor_), we have multiple other methods to transform the data...

i.e. cleaning(df), setup_different_conditional_edits(df), windowing_function_transformations(df), other_statestical_tranformations(df)

Is it performance efficient to pass dataframe as python method argument ? or using UDF is more performance efficient ? what if the dataframe size is in ~ GB (10-100)?

Can you point me to right direction. Thanks for this helpful project :)

AlexIoannides commented 3 years ago

In the example, transform_data represents one step. In practice, there can be many such steps as you allude to. In this instance, best practice as far as I'm concerned, is to 'chain' the steps together - e.g.,

df_1 = cleaned_data(df)
df_2 = other_transformations(df_1)
...

What you do inside each transformation function, and how you choose to do it (e.g. UDF functions or otherwise), is up to you. What we're trying to achieve here is ability to be able to compose ETL jobs from various sub-tasks and make those sub-tasks easy to test.