Closed raxit65535 closed 3 years ago
In the example, transform_data
represents one step. In practice, there can be many such steps as you allude to. In this instance, best practice as far as I'm concerned, is to 'chain' the steps together - e.g.,
df_1 = cleaned_data(df)
df_2 = other_transformations(df_1)
...
What you do inside each transformation
function, and how you choose to do it (e.g. UDF functions or otherwise), is up to you. What we're trying to achieve here is ability to be able to compose ETL jobs from various sub-tasks and make those sub-tasks easy to test.
Hi Alex, It is quite informative and helpful project. However I have a question around ETL job.
In
extract_data(spark)
, we build a dataframe from different sources (i.e. S3, CSV or any database),suppose in
transform_data(df, steps_per_floor_)
, we have multiple other methods to transform the data...i.e.
cleaning(df)
,setup_different_conditional_edits(df)
,windowing_function_transformations(df)
,other_statestical_tranformations(df)
Is it performance efficient to pass dataframe as python method argument ? or using UDF is more performance efficient ? what if the dataframe size is in ~ GB (10-100)?
Can you point me to right direction. Thanks for this helpful project :)